Quality Assurance Processes

Introduction

This page describes the various Quality Assurance facilities that are available in FreeBMD; these facilities are use to:

Identify anomalies - Reports
Determine why the anomalies have occurred - Analysis Tools
Correct the anomalies - Correction Tools

Normally the process would be to follow the above order.

By "anomaly" we are referring to entries or groups of entries that do not correspond to the entries in the index or, infrequently, entries in the index that appear to be wrong. In the latter case the entries are left unchanged but we add information to indicate to researchers the possible correction.

The purpose of the Reports is to identify anomalies systematically but anomalies may be identified in a number of other ways, for example a researcher may report a missing entry which results in a missing page being identified. The analysis and correction tools can also be used in these cases.

While the Reports and Analysis Tools relate to the current data, the Correction Tools will normally only apply after the next update (exceptions here are Postem and Scan corrections which take place immediately).

Not all facilities are freely available, in particular the Correction Tools normally require privileged access. However, many of the Correction Tools operate statically, that is they are files of correction data, and this data can be generated by anyone and then submitted to an administrator for inclusion.

An explanation of the concepts referred to in this page is given here.

Analysis Tools

Overview

This section describes tools that can be used in the quality assurance process. These tools are not necessarily specifically or solely designed for use in this process.

Examining files

Probably the most important tool is Show File which is used to display the contents a particular file. It can be invoked directly by going to Show File and then filling in the details of the file in the form provided. However, it is also often available from Reports or other Analysis Tools as a link that goes directly to the file concerned; details are included with each Report or Tool. In this case it may also take you directly to one or more lines (identified by having a grey background) that are the subject of the reference. Line numbers are included in the listing when this facility is used.

Provenance of entries

It is often important to go from a search result to the file that contains the entry, that is the provenance of the entry. Having clicked on the button next to an entry, the transcriptions that make up that entry are displayed in the Information page; if you shift+click on one of the entries will take you to provenance.pl which will display the usernames and files corresponding to the entry. Clicking on one of these will take you to the line in the file containing the entry (using Show File).

However, having done this once the next time you go to the Information page you will find that the user/filename is displayed next to entry and you can click on this to go directly to the entry in the file (using Show File).

Predicting Quarters

Sometimes it is suspected that a transcription is for the wrong quarter, i.e. the year, quarter or event is wrong in the file header. However, it may be difficult to work out what it should be.

Predict from Volume and Page will go through a file and using data about page ranges (see here) work out the possible quarters that fit. The quarters are presented with the highest probability first, each one having the percentage of lines that would be correct for that quarter. Note that this can only be done when there are sufficient entries available for a quarter to gather the data about page ranges and the prediction needs to be checked against the actual source (e.g. scan) to confirm which quarter it really should be.

It should also be noted that this facility is only available for files that contain entries from a single quarter and event (i.e. RANDOM and OneName files are probably not suitable).

Accession display

The Show Accession page enables the records in an accession to be displayed. The accession can be specified by Accession Number or by the Record Number of an entry in the Accession.

Chunk display

The Show Records in Chunk page enables the records in a chunk to be displayed. The chunk can be specified by Chunk Number or by the Record Number of an entry in the Chunk. Chunk numbers are shown in the Superchunk Report.

Reports

Actually the reports themselves can often be useful analysis tools in the sense that they display information systematically which can be useful in starting to determine the cause of an anomaly.

Correction Tools

Overview

Correction tools operate either immediately on the data in the database (Postems and Scans) or they operate when the update takes place. The tools are either interactive (e.g. Show Postems) or they consist of entries in files that are read during the update. See below for an explanation of the syntactic convention used to describe the data in these files.

When doing corrections the principle of Least Impact should apply. That is, there is more than one way to effect a correction the one that makes the least impact should be used. So if a change could be effected by changing a ONENAME file to RANDOM or by removing the file, then the former should be chosen because it is the most conservative and has the least overall impact on the data in the system.

Entry Corrections

One of the most straightforward ways of implementing a correction is just to request that an entry is changed (which implies that it does not correspond to the entry in the index). On the Information page click on the link to submit a correction in the same way that any researcher could.

You can report that there is a systematic error (e.g. a computer assisted error such as all the surnames being misspelt) by noting in the Source field of the correction that the amendment applies to a number of entries (normally you attach the correction to the first entry in error and refer to "also the next n entries").

Making Alignments

The page Make Alignments is used to initiate and manage informing the update that two non-identical entries should be considered to be identical. There are several ways that such alignments can be specified and these are explained on the Make Alignments page.

Making alignments requires authorisation.

Omitting quarters from files

It is possible to omit the entries in a file for a particular quarter from the update, or, indeed, the whole file. This is normally only done for OneName files, but can also be done for RANDOM files, for the following reasons:

Entries specify the wrong event, year or quarter
Entries all have UCF characters (and thereforecannot be aligned)
Entries are not in index

This facility is effected through entries in a file that have the following format:

<user>/<filename> [ yyyyeq [, yyyyeq ]* ]

and can span lines if the preceding line ends with back slash (\). Example

jonesa/jones_study 1837M3,1867B1,1901M3,1888D1, \
                   1867M3

Comments start with hash (#).

Submit entries for quarters to be omitted, together with justification (preferably as a comment), to the update corrections coordinator

Forcing SuperChunks

Where two chunks are contiguous but have not been joined together into a superchunk it is possible to force them to be joined. This may be done for the following reasons:

There is a gap due to change of the first letter of surname
The transcriptions do not have enough information for the update to create a superchunk (e.g. no +PAGE in file)
The number in the +PAGE entry is incorrect
There is a blank scan

This facility is effected through entries in a file which have the following format:

<user1>/<filename1> [ ,<page1> ] <whitespace> <user2>/<filename2> [ ,<page2> ] <whitespace> [ <datetime> ]

which will force the chunk containing the first file (and page if given) to be forced to be in a superchunk with the second file (and page if given) provided neither file has been modified after the date and time (if present). If the first page is omitted the last page in the file is assumed. If the second page is omitted the first page in the file is assumed. The following is an example

jonesa/1839M3A0024,25 jonesa/1839M3B0001 17/11/07 23:19

Comments start with hash (#).

If there are no page numbers on the +PAGE lines in a file the system assumes a starting page of 1, incrementing by 1.

Submit entries for quarters to be omitted, together with justification (preferably as a preceding comment), to the update corrections coordinator

Accession Order Adjustments

In order to align entries the system has to put accessions in the right order, that is the order they are in the index. It does this by sorting on the names in the file; normally this is just the first name but, if that is unreadable, entries are scanned until a suitable entry can be found.

Occasionally this does not produce the correct order because the entries are out of order in the index or because of a mis-transcription. To overcome this it is possible to tell the system to use an entry other than the first in the accession.

This facility is effected through entries in a file which have the following format:

<user>/<filename> [ ,<page> ] <whitespace> <entrynumber> <whitespace> [ <datetime> ]

and if <page> is omitted the first page in the file is assumed. Entries in the page are numbered from 1 and so sorting will be done on the <entrynumber> entry in the page provided the modification timestamp of the file is before <datetime> (if present). The following is an example

jonesa/1839M3A0024,25 5 17/11/07 23:19

Comments start with hash (#).

If there are no page numbers on the +PAGE lines the system assumes a starting page of 1, incrementing by 1.

Submit entries for quarters to be omitted, together with justification (preferably as a preceding comment), to the update corrections coordinator

Accession Alignments

The system will attempt to align two accessions if sufficient of the entries in the accessions are identical and do not contain UCF characters¹. The proportion can be adjusted but is currently set to 20%. Provided at least this proportion of the entries are identical the system will attempt to align the accessions, including using UCF comparison to determine if two entries are the same. Where UCF has been used in this way it is reported in the UCF Alignments Report (which is not normally used for Quality Assurance purposes).

Where less than this proportion is identical it is possible to instruct the system to attempt align the accessions anyway. This can be done by manually using Make Alignments to align sufficient entries but there is a simpler way with Accession Alignments. This facility is effected through entries in a file which have the following format:

<user1>/<filename1> [ ,<page1> ] <whitespace> <user2>/<filename2> [ ,<page2> ] <whitespace> [ <datetime> ]

and if <page1> or <page2> are omitted the first page in the file is assumed. Entries in the two pages will be aligned provided the modification timestamp of the file is before <datetime> (if present). The following is an example

jonesa/1839M3A0024,25 thomas/39M30024 17/11/07 23:19

Comments start with hash (#).

If there are no page numbers on the +PAGE lines the system assumes a starting page of 1, incrementing by 1

Submit entries for quarters to be omitted, together with justification (preferably as a preceding comment), to the update corrections coordinator

Force File Type

Because of the frequency with which files have been incorrectly given a type of OneName or SEQUENCED (when they should be RANDOM or RANDOM/OneName respectively) there is a facility to force a file to be considered to be RANDOM or OneName irrespective of what its header says. This facility is effected through entries in a file which have the following format:

<user>/<filename>

Comments start with hash (#).

Submit entries for files to forced to be RANDOM or OneName, together with justification (preferably as a preceding comment), to the update corrections coordinator. Please keep RANDOM and OneName separate and clearly indicate which is required.

Reports

Introduction

The system produces reports on a regular basis, mostly during an update but there are also some reports that are produced independently of the update, normally on a week schedule. These are

Suspect files
Duplicate files

Suspect Files

This report contains a list of files that have some aspect of the name of the file and/or the content that could mean there is an error in the file (or its name). The report is produced weekly (the date it was produced is given in the preamble).

Please note that many of these issues are also identified when a file is uploaded so they should not occur for new files. However, the checking of uploaded files as tightened over the years and some old files may still have these issues.

The following is a list of the most common issues found.

Message	Meaning	Example
Year mismatch	The year in the file disagrees with the year implied by the file name	In file 1863M30001 the +S line specifies a year of 1873, i.e. +S,1873,Sep
Quarter mismatch	The quarter (Mar,Jun,Sep,Dec) in the file disagrees with the quarter implied by the file name	In file 1863M30001 the +S line specifies a quarter of March, i.e. +S,1873,Mar
Event mismatch	The event (Births,Deaths,Marriages) in the file disagrees with the event implied by the file name	In file 1863M30001 the +INFO line specifies an event of Births, i.e. +INFO,,,SEQUENCED,BIRTHS
Page number mismatch	The page number in the file disagrees with the page number implied by the file name	In file 1863M30001 the first +PAGE line specifies a page number of 10, i.e. +PAGE,10
Pages outside range	The file contains a high percentage of lines in which the page is outside the range expected for the district (x - y). Note that this refers to the current update - changing the file will only take effect at the next update
No data between +PAGE lines	There are two consecutive +PAGE lines
Duplicate page number	The same number appears in two +PAGE lines
No page number at start of file	There is no +PAGE line at the start of the data
Too many entries between +PAGE	There should be a +PAGE at the start of each page of the index but there are more lines between +PAGE lines than could be on a page of the index
Age at Death (or DOB) missing from file	The Age at Death field (or Date of Birth) is missing from all records in a file from 1st Jan 1866 onwards
Possible alternative name (alias/or)	A name in the file contains the characters 'alias' or 'or' indicating an alternative name that should be transcribed as two entries.	Bonus or Chapman,John,Aston,6a,312

Corrective Action

Since these errors relate to the content of the file without reference to other data (e.g. similar transcriptions) the normal corrective action is to arrange for the file to be changed.

Duplicates

The list of suspected duplicate files gives a list of files that have been transcribed by different users but are identical. They are suspect because in many cases this results from the same file being uploaded by two different users (typically the user and their syndicate coordinator). A facility is available to exclude from this report files that are identical but have been individually keyed (a gratifyingly common occurrence); report such situations as described on the page.

A slightly more sophisticated check is done and presented at the end of the report. This is files that have versions that were previously identical and covers the case where a coordinator uploads the same file as a user but then makes some changes. Because the previous version was the same this gets reported.

Corrective Action

Corrections are normally done by the Syndicate Coordinator.

If files are the same transcription, one of them needs to be removed. Contact the Syndicate Coordinator.

Alternatively if two transcriptions are different, then follow the instructions on page.

Misalignments

Alignment is the process of merging the contents of two transcriptions that refer to the same data (normally to the same page of the index). If this has been done but some entries do not match then we get Misalignments.

See here for additional information on alignments.

The Misalignments Report contains, for each quarter, a list of entries that have not been aligned but, being in a similar position on the page, perhaps should be. This putative alignment if often right but sometimes wrong.

Corrective Action

Use Make Alignments to align the entries (requires administrative privilege). The Misalignments Report enables access to Make Alignments by control+click on an entry thus considerably simplifying the process of making alignment (see details in the report). Furthermore cells that have been aligned, or are in the process of being aligned, are coloured.

Superchunks

See here for an explanation of superchunking

The Superchunk Report lists all the quarters for which Superchunk information is available. Next to each quarter are two numbers in the form (n,m), where m is the number of chunks in the quarter (normally approximately the number of pages) and n is the number of superchunks. The objective is to get n down to 1.

The information about each quarter is arranged in columns:

Column	Contains
1	the superchunk number
2	the files in the first chunk in the superchunk
3	the chunk number of the first chunk in the superchunk
4	the files in the last chunk in the superchunk
5	the chunk number of the last chunk in the superchunk

The lists of files are arranged in order of SEQUENCED, OneName and then RANDOM. Within SEQUENCED the files are arranged in order of page number (although it is uncommon for a chunk to contain more than one page number when it does occur it is very useful to have them ordered).

When examining the data we are looking to understand why chunks have not been joined into one superchunk. So typically we would look at the last chunk of one superchunk and the first chunk of the next and try to determine why they have not been joined. There can be numerous reasons which are explained below.

Take care with page numbers; the page number of the file may not be the same as the page number of the chunk, for example if it is the second page of a double page scan.

Possible problems

Blank pages

Symptom: Page numbers not contiguous between end of one superchunk and start of next and no evidence of missing pages
Cause: Possibly the simplest reason is that there are blank pages where the first letter of the surname changes and therefore a gap in the page numbering; verify that the missing scans don't contain entries.
Corrective Action: Force superchunk

Entries not in index

Symptom: There is a chunk that is inserted between apparently contiguous chunks.
Cause: The chunk contains entries from a one name study but the entries do not appear in the index, probably because the quarter and/or event is wrong.
Corrective Action: Exclude the quarter. Getting the entries corrected is also an option but probably not worth the effort if there is an alternative transcription (which by implication there must be).

Wrong order of entries

Symptom: The superchunks start with pages in the wrong order.
Cause: The most probably cause is that one of the transcriptions, not necessarily the one that is obviously in the wrong place, has an entry at the start of the transcription that sorts it into the wrong order. This can be due to a mis-transcription or due to the index not being in alphabetic order.
Corrective action: Use Accession Order Adjustment to select an entry that will put the transcription in the right order. If the problem is due to a mis-transcription also request a correction.

Bad transcription

Symptom: There is a chunk from a one name study between two contiguous pages
Cause: The quarter has not been transcribed completely, e.g. every page number in the quarter has been transcribed as *.
Corrective action: Exclude OneName quarter

Poor scan

Symptom: The same page appears at the end of one superchunk and the start of another (note: actual page number not just the page number in the filename).
Cause: One (or possibly both) transcriptions has a large number of UCF characters, because it was transcribed from a poor scan, and therefore the two transcriptions have not been aligned.
Corrective action: If the start of the transcriptions coincides (i.e. no missing entries in the first 20%) use Accession Alignment. Otherwise use Make Alignments on individually equivalent entries.

Wrong header

Symptom: A sequenced file appears on its own, between contiguous page numbers
Cause: The file has the wrong header and is therefore for a different quarter/event
Corrective action: Get file corrected; in the meantime exclude with timestamp (so when it is corrected it will be incorporated again).

Wrong file type

Symptom: General chaos!
Cause: Typically this is caused by a RANDOM file being given a type of ONENAME or even SEQUENCED. So the system may effectively be being told that two entries, e.g. one for Jones and one for Wilcox, are contiguous in the index. Depending on the order in which the files are processed this can cause the system to get extremely confused as it tries to work with conflicting information.
Corrective action: Get the offending file corrected. This can be done by getting the file changed but because it is such a common fault (ONENAME should be RANDOM) the Force Random corrective tool is provided and can be used to correct the problem.

Out of place record

Symptom: Accessions not in the correct order or chunks not joined as Superchunks when expected
Cause: This can be caused by a 'rogue' record within a chunk, particularly one that is badly misplaced, for example at the end of a transcription of the surname Dowell there is an entry with the surname Thomas. This is relatively rare and quite difficult to diagnose although the Show Records in Chunk analysis tool can be used to display the records and scanning them can identify the culprit.
Corrective action: Get the offending entry corrected by sumitting a correction.

System fault

Symptom: No obvious reason why chunks have not been joined or are not in the right order.
Cause: There is a problem with the way the system sorts the chunks or analyses the superchunks. This is now comparatively rare but still a possibility.
Corrective action: Get correction done by reporting to

Unlinked Postems

Postems are linked to entries through the content of the entry. It follows therefore that if an entry changes any postem for that entry will no longer be linked to it. By far the most common cause of this that the entry has been corrected. Unfortunately, although understandably, researchers often submit a correction and put the same information in a postem so when the entry gets corrected the postem becomes unlinked. After each update there will be a new batch of unlinked postems.

Corrective Action

When looking at the Postem listing unlinked postems are shown in red and it is possible to request a listing of only unlinked postems (check box "Unlinked only").

Unlinked postems can be

Deleted: if for example the postem gives a correction has been completed
Relinked: if the postem contains any other information (as well as a correction)

In order to do this is necessary to access the Postem listing with administrative privilege. In this mode facilities are provided to

Delete postems
Relink postems
Link from postems to the search page
Link from postems to the transcription
Have the system suggest entries to which postems can be relinked

Scan Links

Where entries have been transcribed from scans held by FreeBMD the system attempts to link the each search result to the scan of the page (or pages) containing the entry. However, this relies on transcribers putting the correct page number at the head of each transcription of a page and, inevitably, errors are made or the rules are not followed. As result of this the selection of the right scan is a complex process.

Where the wrong scan is selected by the system, or no appropriate scan can be found (even though scans are available), researchers are invited to leave feedback, either negative (the scan shown does not contain the entry) or positive (the scan does contain the entry). In addition researchers can attempt to find the right scan themselves and leave positive feedback when they find it.

The way the linking process works, once one entry on a page has been linked to the correct scan (for example, through positive feedback) all other entries on the same page will be correctly linked.

Facilities are available for those with administrative privilege to view the linkages and provide corrections through the Manage Scan Links page. These facilities are used to correct mistakes made in the feedback process and to provide definitive linkages that feedback cannot change.

Self Confirmed Entries

The Self Confirmed page contains a list of entries that have apparently been entered twice by the same transcriber, thus giving rise to the condition whereby the entry has been double keyed by the same person. This is not permitted under FreeBMD rules.

Corrective Action

Each entry (or more likely the entire file) that has been self confirmed needs to be examined and one of the entries (or the entire file) removed.

Dropped Records

The Dropped Records pages contain a list of records that have been dropped (omitted) from the database because:

they were in a RANDOM or ONENAME file
they were a single entry on their own
they are not System Entries
the quarter containing them is completely transcribed
they did not match any other entry

The reason for dropping such records is that analysis has shown that, where a quarter has been fully transcribed, there are a considerable number of such unaligned records and very few of them are accurate. This in part results from the issue that because RANDOM records have no "context" it is very difficult to align them if they do not match exactly.

Corrective Action

There is no specific corrective action; this report is useful for checking why any entry is not in the database. If it is found that a correct record has been dropped the normal approach will be to align it with the related (incorrect) record in the database but specifying it as the "primary" entry (so it is the one displayed in the search results). In the rare instances where no such related entry exists, a System Entry will be generated.

Important Concepts

File type

The file type is specified in the first line of every transcription (the +INFO line) and can take the following values:

SEQUENCED: Used to transcribe complete pages from the index.
RANDOM: Used to transcribe entries that are not related to the location in the index page, for example where only isolated entries from a particular surname are being transcribed.
ONENAME: Used to transcribe sections of the index that relate to a single name. The system assumes that the entries for the same surname are contiguous in the index.

File Structure

There are two formats in which transcribers can record transcriptions:

Standard: This is the normal structure for SEQUENCED files and it is defined in the standard format definition.
Flat: This is a format often used for RANDOM and OneName files and it is defined here.

The key difference is that the Flat file structure has the quarter defined in each line whereas for the Standard file structure the quarter is defined by special lines that apply to all subsequent lines. Irrespective of the file structure, files may only contain events of one type.

Note that there is no mandatory connection between File Type and File Structure, even though as recorded above there is normally a relationship. Hence it is possible to find SEQUENCED files in Flat file structured files.

Alignment

Please see the description of system processes for information on what alignment means.

Accession

An accession is created for each set of contiguous entries in a file. For SEQUENCED files this normally corresponds to a page (although +BREAK would affect this). For OneName files it is a contiguous list of entries with the same surname (unless interrupted by +BREAK). For RANDOM files an accession is always a single entry.

It follows, therefore, that each file consists of one or more accessions.

Chunks

A chunk is a set of contiguous entries displayed in search results. A fuller explanation of chunks is given here.

Aligning Accessions

When determining if accessions should be aligned the system will only consider entries that do not contain UCF characaters. This restriction is necessary in order to avoid problems where very bad scans have been transcribed which can result in files having multiple entries identical entries, for example *,*,*,*,*, which we would not, of course, want to consider identical.

Syntax convention

Where the syntax of a line is specified above, the follow conventions are used:

[ ... ]: The elements between the brackets are optional.
[ ... ]*: The elements between the brackets can be repeated any number of times.
<whitespace>: Any combination of space and tab (but at least one of either).
<datetime>: A timestamp; a number of formats are accepted but the most common is DD/MM/YY hh:mm.

FreeBMD Main Page

Search engine, layout and database Copyright © 1998-2022 Free UK Genealogy CIO, a charity registered in England and Wales, Number 1167484.
We make no warranty whatsoever as to the accuracy or completeness of the FreeBMD data.
Use of the FreeBMD website is conditional upon acceptance of the Terms and Conditions