A Correlation-Preserving Fingerprinting Technique for Categorical Data in Relational Databases
Abstract
Fingerprinting is a method of embedding a traceable mark into digital data, to verify the owner and identify the recipient a certain copy of a data set has been released to. This is crucial when releasing data to third parties, especially if it involves a fee, or if the data is of sensitive nature, due to which further sharing and leaks should be discouraged and deterred from. Fingerprinting and watermarking are well explored in the domain of multimedia content, such as images, video, or audio.The domain of relational databases is explored specifically for numerical data types, for which most state-of-art techniques are designed. However, many datasets also, or even exclusively, contain categorical data.We, therefore, propose a novel approach for fingerprinting categorical type of data, focusing on preserving the semantic relations between attributes, and thus limiting the perceptibility of marks, and the effects of the fingerprinting on the data quality and utility. We evaluate the utility, especially for machine learning tasks, as well as the robustness of the fingerprinting scheme, by experiments on benchmark data sets.
Origin | Files produced by the author(s) |
---|