Submission Formats
This page describes the main format for submission of transcriptions to FreeBMD.
However, if you are using one of the add-on packages to do
transcription, for example, WinBMD or SpeedBMD, you do not need to understand all the detail
on this page although if you have problems this page will be a useful reference
source. For help with using one of the add-on packages please see
these help instructions or use the help with the package.
If you already have entries which you need to submit to FreeBMD, you may find the
alternative flat format more convenient.
Help with transcribing
This page describes the format of data to be submitted to FreeBMD. You can produce
these formats using your favourite wordprocessor, spreadsheet or database. There
will almost certainly be a way to "save as" or "export" to comma delimited ASCII
text. Alternatively, use one of the previously mentioned add-on
packages written by FreeBMD volunteers.
If you are new to transcribing we suggest that you prepare a small number of entries
and try submitting them. This way, if you haven't quite achieved the correct layout,
there won't be too much to alter. You will find the submission mechanism will help to
some extent in identifying any problems there might be. If you can't resolve the
errors, check the Transcribers' Knowledge Base or email the
mailing list; see here for how to subscribe to the Admins
mailing list.
Definition of the Standard Format
The standard format consists of
- a header
- which gives information that applies to the whole file
- one or more source information lines
- each of which identifies the source and date of the
following data lines
- data lines
- which give the actual entries together with information about the
context of the entries (e.g. where pages start in the index)
Example
This is an example of a submission in the standard format (note that the ellipses (...)
indicate other entries omitted from the example).
+INFO,camilla@algroup.co.uk,,SEQUENCED,BIRTHS,cp850
+CREDIT,Ben Laurie,ben@algroup.co.uk,CREDIT
+F,1837,Sep,COL-GRE,2
+PAGE,123
Forden,Henry,Shaftsbury,8,69
Forden,male,Devizes,8,243
...
Giddins,Catherine,Manchester,20,290
Giddins,Emma,Hatfield&Welwyn,6,358
Giddins,George,Oxford,16,71
Giddins,George John,Hertford,6,381
Gidd_ns,Thomas,Oundle,15,203
+PAGE,124
+S,1837,Dec,ANC-01
+PAGE,1045
Powers,Edith Frances,Colne,8,213
Powers,George William,Devizes,8,243
....
Rushton,Martha Maria,Devizes,8,249
Rushton,Mary Ann,Huntingdon,14,145
+PAGE,1046
Rushton,Naomi,St. Neot's,14,168
Rushthorpe,John,Marylebone,1,123
+BREAK
Note that there is a variation of the standard format called (for reasons
lost in antiquity) the Flat Format. This format is particularly useful
for files that consist of unrelated individual entries and its definition is
here.
Conventions used in describing the Standard Format
In the following sections, looking at the different parts of the Standard Format,
the conventions used are these:
- Text in italics should be typed exactly as shown.
- Fields written in plain or underlined text are where
your information goes.
- The underlined sections must be entered, or your submission
will be rejected.
- The plain sections are optional.
The Information Line
+INFO,Email,Password,Sequenced,RecordType,CharacterSet
- Email
- is the email address of the transcriber and it is
optional (it can just be omitted).
- Password
- should not be used and should be omitted.
(This field is a hangover from early in the project's history.)
- Sequenced
- is one of SEQUENCED,
RANDOM or ONENAME.
This information is used to assist with the correlation of transcriptions. This determines what entries have been double keyed and thereby also assists in identifying suspect entries.
- SEQUENCED
- Should be used to transcribe complete pages from the index. If only part of a page has been transcribed, +BREAK should be used to indicate where the index contains more entries than the transcription.
- RANDOM
- Should be used to transcribe entries that are not related to the location in the index page, for example where only isolated entries from a particular surname are being transcribed.
- ONENAME
- Should be used to transcribe sections of pages that relate to a single name. Normally the source specifier +B should be used, this being where the transcription is from the paper indexes. Several years/quarters can be mixed in a file by using multiple +B lines. (Note that events (Births, Deaths, Marriages) cannot be mixed in a file.) +BREAK should be used if a section of surnames is omitted (e.g. between transcriptions of Brown, John and Brown, Martin). The system puts an implicit +BREAK when the surname changes.
Please note:
- Only transcriptions from index pages are allowed to be uploaded to FreeBMD
- The Flat File format can be used with any type of file although it is most commonly used with RANDOM
- See below for use of +PAGE (required particularly for SEQUENCED)
- For ONENAME and SEQUENCED, the entries in a sequence (between +BREAK or +PAGE) cannot all contain UCF characters.
- RecordType
- is one of BIRTHS,
MARRIAGES, or DEATHS.
- CharacterSet
- is a supported character set. FreeBMD's standard
character set is ISO 8859-1, but most others are supported. In particular, if you
are transcribing on a DOS or Windows machine (unless it is Windows NT -
except in a DOS window under NT, of course!), it is pretty likely that you are
using the character set known as "code page 850" - in which case, use
cp850 in this field. Macintosh users should probably use
macintosh, but this may vary according to software used. This
information is used to correctly recognise accented characters. This is a complex
area, so if you need to use some other character set, please contact us at
for advice.
When other people have been involved in producing the data, or the file,
then this field can be used to credit them. The information put here will be
available to users of the search system.
+CREDIT,Name,EMail,Comment
If there is a Credit line present then the entries will be credited to the
transcriber identified by the credit line, otherwise they will be credited
to the submitter.
- Name
- The name of the person who actually transcribed the entries
- Email
- The email address of the person who actually transcribed
the entries
- Comment
- One of the following values:
- CreditAnon
- Don't report Name or EMail
- Credit or CreditReport
- Report only Name
- CreditInvite
- Report Name and EMail and invite research enquiries
- Other
- Credit line is ignored
Source Information
Entries can be gathered, for example, from microfiche, microfilm or the original
index books and the source information defines the type of the source and some
additional information. For each type the following information is mandatory:
- Year
- The year of the transcribed information.
- Quarter
- The quarter in which the information appears in the index,
one of March, June, September
and December.
Some types have optional fields as follows
- Source
- where the fiche/film/book was accessed - this is to allow
the possibility in the future of identifying different versions of the sources, which
may be useful for error correction.
- TranscriptionDate
- the preferred format for the date is"day monthname year" (e.g. 25 March 1960).
Fiche Info
+F,Year,Quarter,FicheRange,FicheNumber,Source,TranscriptionDate
- FicheRange
- the start and end letters of the fiche separated by a hyphen, e.g. LAN-MON.
- FicheNumber
- the number as it appears on the fiche.
Microfilm Info
+M,Year,Quarter,FilmRange,FilmNumber,Source,TranscriptionDate
Book Info
+B,Year,Quarter,Source,TranscriptionDate
Scan Info
+S,Year,Quarter,FreeBMDReference,TranscriptionDate
This source type is only for use with scans provided by the
FreeBMD project.
- FreeBMDReference
- is allocated by us to keep track of
the various scans and to ensure, when there is more than one set of scans,
we can tell which one was used for the transcription.
The FreeBMDReference
for a scan occurs between the month and either the range or scan file name, so for
1840/Deaths/June/UKD-01/A-C/1840D2-A-C-0010.tif
the FreeBMDReference is
UKD-01, and for
1893/Births/September/LDS-211-000-0951147/1893b3-001.tif
the FreeBMDReference is
LDS-211-000-0951147.
Typical values are:
UKD-01
GRO-B2108
LDS-211-000-0951131
It is permitted, although not required, to include the scan filename,
separated by space or / or \. Thus
LDS-211-000-0951147/1893b3-001.tif
is a valid FreeBMDReference
althought it should be noted that the filename must be the actual name of
the scan file even if this differs
from the +PAGE value in the transcription.
When a file is uploaded, if FreeBMDReference does not conform
to these rules it will be ignored.
See the scan filename format for more
information.
Unknown Source Info
+U,Year,Quarter
Data line
Each data line is transcribed using the rule "Type What You See", that is the
line should be an accurate representation of what is in the index. If you think
what is in the index is wrong you can add a #THEORY line
but the entry itself should still be what you see.
In applying "Type What You See" you do not transcribe:
- Commas between fields;
- The rows of identical dots that separate fields in the later printed index
(see below); or
- Full stops after Age, Volume or Page Number.
These are all merely data separators, and carry no data value.
Note also:
- Victoria handwriting used what looks to our 21st century eyes like "fs"
to represent "ss" - transcribe as "ss";
- Raised letters, with or without dots beneath, are typographical conventions -
just transcribe the letter;
- The case of a letter does not affect the meaning - transcribing the case (upper
or lower case) as seen is preferable but not critical.
- Alternatives or aliases (e.g. BONUS alias CHAPMAN) are normally transcribed as
two records; see here for more details.
Accented characters can be used in some fields (e.g name fields). Here
is the standard character set, but
almost any known set can be used (see +INFO above).
Commas within fields are permitted so long as that's how they appear in
the source. Put the contents of the whole field in quotes. e.g. "St. Geo., Hanover Square"
Where there is a fullstop at the end of a name that is not part of the row of
fullstops that separates fields (in the later printed index) it should be transcribed.
Examples:
Smith John.......Aston,6d 999
Smith John J.....Aston,6d 999
These assume NO full stop and none should be transcribed
Smith John. .....Aston,6d 999
Smith John J. ...Aston,6d 999
These assume full stop and should be transcribed.
Start/End Of Page
This is used in a SEQUENCED dataset at the start and end of each
page and is used to assist in collation of entries, linking the transcription back
to its originating scan or fiche and to estimate completeness. At the start of a
transcription, at the end of the transcription and at each new page enter
+PAGE,PageNumber
where PageNumber is
| Source | PageNumber |
| Scan | the last part of the name of the scan file, which is numeric possibly followed by a letter A or B, e.g. for a scan file named 1840M2-L-0243a.gif the page would be 243a. Please note:
- ignore text such as "rescan" in the scan file name
- only put in a letter suffix ("a" in the above example) if it occurs in the scan file name
- if a scan is split across several images, normally suffixed w,x,y,z, transcribe the entries from all the images as one page with a page number without the suffix
- if the scan is of a double page each page should start with +PAGE, the second page being one greater than the first
|
| Fiche or film with sequential page numbers | sequential page number |
| Other | start with page number 1, increasing by 1 |
It is important to include +PAGE,n at the beginning/end of each page. The
page number should be the number of the page that follows. If the +PAGE is
at the end of the dataset (or the complete volume), it should be one greater than
the last page number transcribed.
Comments
You may put in comments if you wish; there are three different types of comment
each of which has a different use.
- Line starts with #COMMENT
- Used to indicate that what has been transcribed differs in some way from what
is in the index, e.g. #COMMENT handwritten addendum says "see Mar 1887". This type of
comment will be accessible from the search results.
- Line starts with #THEORY
- Used to indicate that what has been transcribed is what is in the index
but there is reason to believe the index is wrong, e.g. #THEORY surname should probably
Lane not Laine. This type of comment will be accessible from the search results.
- Line starts with # (not followed by COMMENT or
THEORY)
- Used to give information about the transcription, e.g. # scan got very faint at
this point. WinBMD and SpeedBMD use this type of comment to include information about
the transcription immediately after the header. This
type of comment will not be accessible from the search results.
Using #COMMENT
Note that the # must be the first character on the line and the
comment applies to the immediately preceding entry. For a comment that
applies to the immediately preceding entry, plus entries following, to a total of
N entries (including the one preceding) use the following form.
#COMMENT(N)
Using #THEORY
#THEORY (typed exactly as shown) is a special type of comment that you can use to
identify a record which you think might be wrong, that is the entry is perfectly
readable but you think the original source is wrong. Using #THEORY makes it easy
for the record to be identified and is displayed with the information about an
entry.
For example, you might be transcribing a page of the name JONES, when you come across the following:
JONE, Albert, District, 1a, 123,
after which the name JONES continues.
Following the rule of 'type what you see', you should type JONE, but then if you wish, you can insert in the row immediately following:
#THEORY Surname should be JONES.
If there is more than one record affected, say N, use the following form:
#THEORY(N)
So, continuing our previous example:
JONE, Albert, District, 1a, 123
#THEORY(3) Surname should be JONES.
JONE, Charles, District, 3a, 324
JONE, David, District, 11a, 642
JONES, Edward, District, 9a, 912
Note that the #THEORY(N) is put after the first record but N is the total number of records including the first.
Data Breaks
In a SEQUENCED or ONENAME dataset, there
will be breaks in the data - that is, where there are entries in the index that
are not in the dataset you are submitting (where you broke off transcribing,
where part of your source was missing, or whatever). These are indicated with:
+BREAK
In a RANDOM dataset, there is implicitly a
+BREAK between each entry.
What should I call my Upload File
When you "Manage your files"
and click on "Upload new file" there are 2 boxes at the top:
- File name
- this is the name used to store your file in the FreeBMD
system and it must consist of alphanumeric or underscore characters (but must
not contain the characters "." or "-"). Typically the name you put in here
would be the same as the name of the file (see below) with without the extension,
e.g. if the file is 1876B3A0041.BMD you would put 1876B3A0041 in here.
- Upload
- this is where you specify the file you want uploaded from
your computer. The name is not stored by the FreeBMD system.
Changes in Entry Information
Over the years extra information was recorded in the indexes.
Births to June quarter 1911 and Marriages to December quarter 1911
Surname,GivenNames,District,Volume,Page
Births from September quarter 1911
Surname,GivenNames,MothersName,District,Volume,Page
Marriages from March quarter 1912
Surname,GivenNames,SpousesName,District,Volume,Page
Deaths to March Quarter 1969
Surname,GivenNames,AgeAtDeath,District,Volume,Page
Deaths to December quarter 1865 did not have an AgeAtDeath field - for
those, leave it blank.
Deaths from June Quarter 1969
Surname,GivenNames,DateOfBirth,District,Volume,Page
Uncertain character format
_ (Underscore) |
A single uncertain character. It could be anything but is definitely one character. It can be repeated for each uncertain character.
|
* (Asterisk) |
Several adjacent uncertain characters. A single * is used when there are 1 or more adjacent uncertain characters. It is not used immediately before or after a _ or another *.
Note: If it is clear there is a space, then * * is used to represent 2 words, neither of which can be read.
|
[abc] |
A single character that could be any one of the contained characters and only those characters. There must be at least two characters between the brackets. For example, [79] would mean either a 7 or a 9, whereas [C_] would mean a C or some other character.
|
{min,max} |
Repeat count - the preceding character occurs somehere between min and max times. max may be omitted, meaning
there is no upper limit. So _{1,} would be equivalent to *, and _{0,1} means that it is unclear if there
is any character. Ensure the complete field is enclosed in quotes to avoid the comma
being taken as a field separator, e.g. "williams{0,1}".
|
? (Question mark) |
Only used where it is unambiguous that the source data is actually missing from a column, e.g a missing Volume.
Note: If it is unclear whether the column is empty or not _{0,1} is used.
|
Note: Using a single * is preferable to spending a long time trying to
decide the min and max values to use in the
_{min,max} format, which is more precise.
Technical
note: Although this UCF format has many similarities to regular expressions (e.g.
Perl, Unix) it is not identical and in particular there is no escape mechanism.
FreeBMD Main Page
© 1998-2008 The Trustees of FreeBMD (Ben Laurie, Graham Hart, Camilla von Massenbach and David Mayall), a charity registered in England and Wales, Number 1096940.
We make no warranty whatsoever as to the accuracy and completeness of the FreeBMD data. Use of the FreeBMD website is conditional upon acceptance of the Terms and Conditions