Record linkage
Basics
Record linkage refers to the task of finding identical entries in two or more files. The initial idea goes back to Halbert L. Dunn ("Record Linkage" in: American Journal of Public Health, Vol. 36 (1946), 1412–1416). In the 1950s, Howard Borden Newcombe laid the probabilistic foundations of modern record linkage theory.
In 1969, Fellegi and Sunter formalized these ideas. Their pioneering work "A Theory For Record Linkage" is, still today, the mathematical tool for any record linkage application.
Mathematical Model
In an application with two files, A and B, denote the rows (records) by <math>\alpha (a)<math> in file A and <math>\beta (b)<math> in file B. Assign <math>K<math> characteristics to each record. The set of records that represent identical entities is defined by
<math> M = \left\{ (a,b); a=b; a \in A; b \in B \right\} <math>
and the complement of set <math>M<math>, namely set <math>U<math> representing different entities is defined as
<math> U = \{ (a,b); a \neq b; a \in A, b \in B \} <math>.
A vector, <math>\gamma<math> is defined, that contains the coded agreements and disagreements on each characteristic:
<math> \gamma \left[ \alpha ( a ), \beta ( b ) \right] = \{ \gamma^{1} \left[ \alpha ( a ) , \beta ( b ) \right] ,..., \gamma^{K} \left[ \alpha ( a ), \beta ( b ) \right] \} <math>
where <math>K<math> is a subscript for the characteristics (sex, age, martial status, etc.) in the files. The conditional probabilities of observing a specific vector <math>\gamma<math> given <math>(a, b) \in M<math>, <math>(a, b) \in U<math> are defined as
<math>
m(\gamma) = P \left\{ \gamma \left[ \alpha (a), \beta (b) \right] | (a,b) \in M \right\} =
\sum_{(a, b) \in M} P \left\{\gamma\left[ \alpha(a), \beta(b) \right] \right\} \cdot
P \left[ (a, b) | M\right]
<math>
and
<math>
u(\gamma) = P \left\{ \gamma \left[ \alpha (a), \beta (b) \right] | (a,b) \in U \right\} =
\sum_{(a, b) \in U} P \left\{\gamma\left[ \alpha(a), \beta(b) \right] \right\} \cdot
P \left[ (a, b) | U\right],
<math> respectively.
Categories: Set theory | Computer algebra | Data management