FEDERAL: A Framework for Distance-Aware Privacy-Preserving Record Linkage
The aim of this project is to encode the integer value and string value, sent to health ministry (TPA) in proper security.
In this section, non-matching pairs as small as possible. In the second step, termed as matching, the distances between the pairs formed during the blocking step are calculated. Approximate matching lies at the core of record linkage, since values contained in records that are owned by different data custodians, but refer to the same real-world entity, usually exhibit variations, errors, misspellings, and typos. Queries of the form ‘Report the number of patients who suffer from a certain disease, and whose age or specific biological measurements lie in a certain numerical range’. The government agency should be able to compute the aggregated result by relying solely on anonymized numerical values of the ages and biological measurements. We have to stress that the problem we tackle cannot be solved by treating numerical values as strings, because the corresponding metric spaces use entirely different distance metrics. As an example, although 99 and 100 are close numerical values, the same values are found to be distant if they are cast as strings and their edit distance is calculated.
We introduce a novel framework for privacy-preserving record linkage, which incorporates distance-aware encoding methods for anonymizing sensitive values in data records. Key components of our framework are two methods, called LCBF (Low-Cost Bloom Filters) and BV (Bit Vectors), which are used to anonymize string and numerical (including timestamps) values, respectively. The anonymization space, into which we embed the original data values, is the binary Hamming space that allows for effective approximate matching. By doing so, we immediately gain the advantages of (a) low computational cost for performing distance computations, (b) reduced storage requirements, (c) low communication overhead, and (d) direct application of Locality-Sensitive Hashing (LSH) blocking. LSH blocking, combined with Bloom filters, is the only private technique that provides theoretical guarantees of the completeness of the results (in the anonymization space).