Source | quotes.michelepasin.org

In total we have 3 quotes from this source:

Fortunately, an increasing body of research concerns itself with the automatic cleanup of large databases. We describe a generic approach that utilizes robust findings from this body of research, and that is applicable to object databases typically found in cultural heritage fields. We focus on the role of automatic database cleanup technology as a strong filter, making the data-cleaning task feasible for the human expert (researcher, curator) by highlighting potential errors and inconsistencies, which the expert can then manually check and correct.

#database #experts

Data cleaning using machine learning

Trivially but crucially, if 5 percent of the data in the average manually entered database is incorrect, then 95 percent is correct. Hence, statistically speaking, it is possible to cast the data cleaning problem as an outlier detection task. Consider a fl at, nonrelational database describing N cultural heritage objects using M columns. Each of the N × M database cells can be tested for an outlier value. To determine whether a particular value is an outlier, we exploit the frequent interdependencies between different database columns. For example, the style of an artifact (for example, “black-fi gure pottery”) may say something about its likely origin (“Greek”). Therefore it is often possible to predict the value of a database cell on the basis of the values of the other cells in that database row. Outliers are cases in which the cell value deviates from the predicted value.

#database

Errors in cultural heritage databases

Jonathan Maletic and Andrian Marcus estimate that about 5 percent or more of the information present in manually created databases is erroneous. 5 Many possible causes for errors exist. First, some errors are due to interpretation discrepancies: different people who en ter data in a single database may have different interpretations of what type of information to enter in particular cells. This tends to hold true for many cultural heritage databases, where the database structure is typically created by the curators or researchers themselves, rather than by professional data managers. Consequently, such databases are often subject to limited quality control: that is, there are no strict (or enforced) guidelines of what information should go in different database cells or how the information should be represented or formatted. Even when the intended database structure is adhered to, database records may be corrupted by typos and copy-andpaste errors, or through optical character recognition errors if the digitization process of the source text was automatic. Also, when a database has evolved over time, the naming conventions may have changed, as often happens in zoological taxonomies, rendering some information outdated.

#database #information