TDDA: Test-Driven Data Analysis

TDDA uses file inputs (such as NumPy arrays or Pandas DataFrames) and a set of constraints that are stored as a JSON file.

  • Reference Tests supports the creation of reference tests based on either unittest or pytest.

  • Constraints is used to retrieve constraints from a (pandas) DataFrame, write them out as JSON and check whether records satisfy the constraints in the constraints file. It also supports tables in a variety of relational databases.

  • Rexpy is a tool for automatically deriving regular expressions from a column in a pandas DataFrame or from a (Python) list of examples.

1. Imports

[1]:
import numpy as np
import pandas as pd

from tdda.constraints import discover_df, verify_df
[2]:
df = pd.read_csv(
    "https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/iot_example.csv"
)

2. Check data

With pandas.DataFrame.sample we display ten random data sets:

[3]:
df.sample(10)
[3]:
timestamp username temperature heartrate build latest note
134749 2017-02-24T08:10:53 michellereed 11 77 24e04514-95c1-823b-1559-eda8a3eb7b35 0 update
53301 2017-01-22T19:17:53 donaldjohnson 28 84 256c0202-0696-6ecf-5979-13d1886a3120 0 user
79145 2017-02-02T03:45:05 derek53 25 63 8b03fa9d-5992-22f6-9977-b668b3ffc3c6 1 test
77111 2017-02-01T07:59:34 jason72 24 84 487965ae-c546-eef9-43ef-c09d4780dfdc 0 user
99493 2017-02-10T06:38:47 tonyahopkins 18 74 30287b8f-afbb-e2e8-7e8a-4ce35191fdaa 0 sleep
74375 2017-01-31T05:37:53 wholmes 24 75 52428d30-6f5d-2c95-b6a9-82f5a587d47a 0 interval
144199 2017-02-28T03:01:21 velazquezchristina 22 71 91b3a39a-26cc-1a22-9d92-946d6bbe8b12 0 interval
50757 2017-01-21T18:52:32 sdonovan 5 77 1b18d840-dc92-3f8b-85a1-8904a00c697c 0 test
127798 2017-02-21T13:21:39 mossdavid 14 62 b80db398-98d4-061a-d29a-284dac7edc90 0 interval
96068 2017-02-08T21:53:09 veronicaanderson 17 78 c344b7b5-b032-757b-21e2-cc9762fcd6d5 1 test

And with pandas.DataFrame.dtypes we display the data types for the individual columns:

[4]:
df.dtypes
[4]:
timestamp      object
username       object
temperature     int64
heartrate       int64
build          object
latest          int64
note           object
dtype: object

3. Creating a constraints object

With discover_constraints a constraints object can be created.

[5]:
constraints = discover_df(df)
[6]:
constraints
[6]:
<tdda.constraints.base.DatasetConstraints at 0x127584ec0>
[7]:
constraints.fields
[7]:
Fields([('timestamp', <tdda.constraints.base.FieldConstraints at 0x1275846e0>),
        ('username', <tdda.constraints.base.FieldConstraints at 0x115fe74d0>),
        ('temperature',
         <tdda.constraints.base.FieldConstraints at 0x115fe7d90>),
        ('heartrate', <tdda.constraints.base.FieldConstraints at 0x127594770>),
        ('build', <tdda.constraints.base.FieldConstraints at 0x127594b00>),
        ('latest', <tdda.constraints.base.FieldConstraints at 0x127598b90>),
        ('note', <tdda.constraints.base.FieldConstraints at 0x1160dd9d0>)])

4. Writing the constraints into a file

[8]:
with open("../../data/ignore-iot_constraints.tdda", "w") as f:
    f.write(constraints.to_json())

If we take a closer look at the file, we can see that, for example, a string with 19 characters is expected for the timestamp column and temperature expects integers with values from 5-29.

[9]:
!cat ../../data/ignore-iot_constraints.tdda
{
    "creation_metadata": {
        "local_time": "2024-11-02T19:34:53",
        "utc_time": "2024-11-02T18:34:53",
        "creator": "TDDA 2.2.05",
        "host": "fay.local",
        "user": "veit",
        "n_records": 146397,
        "n_selected": 146397
    },
    "fields": {
        "timestamp": {
            "type": "string",
            "min_length": 19,
            "max_length": 19,
            "max_nulls": 0,
            "no_duplicates": true
        },
        "username": {
            "type": "string",
            "min_length": 3,
            "max_length": 21,
            "max_nulls": 0
        },
        "temperature": {
            "type": "int",
            "min": 5,
            "max": 29,
            "sign": "positive",
            "max_nulls": 0
        },
        "heartrate": {
            "type": "int",
            "min": 60,
            "max": 89,
            "sign": "positive",
            "max_nulls": 0
        },
        "build": {
            "type": "string",
            "min_length": 36,
            "max_length": 36,
            "max_nulls": 0,
            "no_duplicates": true
        },
        "latest": {
            "type": "int",
            "min": 0,
            "max": 1,
            "sign": "non-negative",
            "max_nulls": 0
        },
        "note": {
            "type": "string",
            "min_length": 4,
            "max_length": 8,
            "allowed_values": [
                "interval",
                "sleep",
                "test",
                "update",
                "user",
                "wake"
            ]
        }
    }
}

5. Checking data frames

To do this, we first read in a new csv file with pandas and then have ten data records output as examples:

[10]:
new_df = pd.read_csv(
    "https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/iot_example_with_nulls.csv"
)
new_df.sample(10)
[10]:
timestamp username temperature heartrate build latest note
117767 2017-02-17T13:18:01 flynnkimberly NaN 80 NaN 0.0 NaN
85787 2017-02-04T19:29:23 djohns NaN 67 e2af7ab7-938e-57e7-975d-441f84052abe 1.0 NaN
128382 2017-02-21T18:59:14 bradley71 26.0 84 86e34cba-6af2-5ae6-bb43-5f58839ddfc4 1.0 NaN
126189 2017-02-20T21:55:00 vprice 12.0 64 780420ce-b902-fe52-d9f8-a46daa269dcc NaN interval
107327 2017-02-13T09:35:13 rachelrobinson 27.0 81 9c9cbbb2-22d1-29cf-c08e-ea5014fa22c7 1.0 wake
119362 2017-02-18T04:40:23 diana54 20.0 61 12b0bff3-7598-37f9-a4d9-79a6f58756a8 0.0 test
80993 2017-02-02T21:30:02 tamarafrost 16.0 85 NaN 1.0 wake
46022 2017-01-19T21:31:10 usellers NaN 78 98e08b68-6fe7-8345-a583-658516f4001b 0.0 NaN
132568 2017-02-23T11:17:09 jeffreyrodgers NaN 64 NaN NaN interval
113676 2017-02-15T22:17:25 mcculloughmichelle 7.0 77 caf5d3ca-734f-6683-0d55-24c07d0c9e33 0.0 test

We see several fields that are output as NaN. Now, to analyse this systematically, we apply verify_df to our new DataFrame. Here, passes returns the number of passed constraints, and failures returns the number of failed constraints.

[11]:
v = verify_df(new_df, '../../data/ignore-iot_constraints.tdda')
[12]:
v
[12]:
<tdda.constraints.pd.constraints.PandasVerification at 0x127585e80>
[13]:
v.passes
[13]:
30
[14]:
v.failures
[14]:
3

We can also display which constraints passed and failed in which columns:

[15]:
print(str(v))
FIELDS:

timestamp: 0 failures  5 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✓  no_duplicates ✓

username: 0 failures  4 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✓

temperature: 1 failure  4 passes  type ✓  min ✓  max ✓  sign ✓  max_nulls ✗

heartrate: 0 failures  5 passes  type ✓  min ✓  max ✓  sign ✓  max_nulls ✓

build: 1 failure  4 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✗  no_duplicates ✓

latest: 1 failure  4 passes  type ✓  min ✓  max ✓  sign ✓  max_nulls ✗

note: 0 failures  4 passes  type ✓  min_length ✓  max_length ✓  allowed_values ✓

SUMMARY:

Constraints passing: 30
Constraints failing: 3

Alternatively, we can also display these results in tabular form:

[16]:
v.to_frame()
[16]:
field failures passes type min min_length max max_length sign max_nulls no_duplicates allowed_values
0 timestamp 0 5 True NaN True NaN True NaN True True NaN
1 username 0 4 True NaN True NaN True NaN True NaN NaN
2 temperature 1 4 True True NaN True NaN True False NaN NaN
3 heartrate 0 5 True True NaN True NaN True True NaN NaN
4 build 1 4 True NaN True NaN True NaN False True NaN
5 latest 1 4 True True NaN True NaN True False NaN NaN
6 note 0 4 True NaN True NaN True NaN NaN NaN True