Jørgen 的个人资料Guldmann Fumbles with Ma...照片日志列表 工具 帮助

日志


Data Profiling

When venturing into any data quality program, is data profiling an essential cornerstone to turn. It provides a wealth of information about the data that you have. A master data repository must hold functionality to automatically identify data quality issues in a more than one way. Common for needed profiling plug-in is the requirement to drill trough to the very entries causing the anomalies, and the very same data must facilitate KPI reports over the same.

 Basic statistics, frequencies, selectability, data patterns, ranges and outliers.

Through patterning the content of the attributes the possibility to detect a vararity of condition emerges. Say Postal code is mostly created with two chars indicating the country, followed by a space and then 4 numeric chars indicating the postal area, consider the mount of information deriving from patterning of this information.

e.g.

Attribute Entry

Pattern

DK 8000

XX[_]9999

DK 9000

XX[_]9999

DK8000

XX9999

8000DENMARK

9999XXXXXXX

9200

9999

This patterning makes it easy to query on, combined with the knowledge of how many times the pattern emerges, the count of spaces and NULL values, the max length, min length, average length makes the foundation of any data analysis.

Looking into the attributes selectability gives strong indication if the attribute is a candidate for a unique business key.

Numeric range analysis provides knowledge about utilization.

Data patterns gives info regarding which entries doesn’t applied with a given mask

Datatype selection

This example is boiled down, naturally there will be far more data types and attributes in a real life scenario.

DataType

Attribute Name

Result

INT

CustomerNumber

100%

DATETIME

CustomerNumber

0%

INT

Name

0%

DATETIME

Name

0%

INT

PhoneNumber

80%

DATETIME

PhoneNumber

0%

INT

Birthdate

0%

DATETIME

Birthdate

100%