Trivially but crucially, if 5 percent of the data in the average manually entered database is incorrect, then 95 percent is correct. Hence, statistically speaking, it is possible to cast the data cleaning problem as an outlier detection task. Consider a fl at, nonrelational database describing N cultural heritage objects using M columns. Each of the N × M database cells can be tested for an outlier value. To determine whether a particular value is an outlier, we exploit the frequent interdependencies between different database columns. For example, the style of an artifact (for example, “black-fi gure pottery”) may say something about its likely origin (“Greek”). Therefore it is often possible to predict the value of a database cell on the basis of the values of the other cells in that database row. Outliers are cases in which the cell value deviates from the predicted value.

« Data cleaning using machine learning »

A quote saved on Feb. 26, 2013.


Top related keywords - double-click to view: