TDDA: Test-Driven Data Analysis¶

TDDA uses file inputs (such as NumPy arrays or Pandas DataFrames) and a set of constraints that are stored as a JSON file.

Reference Tests supports the creation of reference tests based on either unittest or pytest.
Constraints is used to retrieve constraints from a (pandas) DataFrame, write them out as JSON and check whether records satisfy the constraints in the constraints file. It also supports tables in a variety of relational databases.
Rexpy is a tool for automatically deriving regular expressions from a column in a pandas DataFrame or from a (Python) list of examples.

1. Imports¶

[1]:

import numpy as np
import pandas as pd

from tdda.constraints import discover_df, verify_df

[2]:

df = pd.read_csv(
    "https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/iot_example.csv"
)

2. Check data¶

With pandas.DataFrame.sample we display ten random data sets:

[3]:

df.sample(10)

[3]:

	timestamp	username	temperature	heartrate	build	latest	note
134749	2017-02-24T08:10:53	michellereed	11	77	24e04514-95c1-823b-1559-eda8a3eb7b35	0	update
53301	2017-01-22T19:17:53	donaldjohnson	28	84	256c0202-0696-6ecf-5979-13d1886a3120	0	user
79145	2017-02-02T03:45:05	derek53	25	63	8b03fa9d-5992-22f6-9977-b668b3ffc3c6	1	test
77111	2017-02-01T07:59:34	jason72	24	84	487965ae-c546-eef9-43ef-c09d4780dfdc	0	user
99493	2017-02-10T06:38:47	tonyahopkins	18	74	30287b8f-afbb-e2e8-7e8a-4ce35191fdaa	0	sleep
74375	2017-01-31T05:37:53	wholmes	24	75	52428d30-6f5d-2c95-b6a9-82f5a587d47a	0	interval
144199	2017-02-28T03:01:21	velazquezchristina	22	71	91b3a39a-26cc-1a22-9d92-946d6bbe8b12	0	interval
50757	2017-01-21T18:52:32	sdonovan	5	77	1b18d840-dc92-3f8b-85a1-8904a00c697c	0	test
127798	2017-02-21T13:21:39	mossdavid	14	62	b80db398-98d4-061a-d29a-284dac7edc90	0	interval
96068	2017-02-08T21:53:09	veronicaanderson	17	78	c344b7b5-b032-757b-21e2-cc9762fcd6d5	1	test

And with pandas.DataFrame.dtypes we display the data types for the individual columns:

[4]:

df.dtypes

[4]:

timestamp      object
username       object
temperature     int64
heartrate       int64
build          object
latest          int64
note           object
dtype: object

3. Creating a constraints object¶

With discover_constraints a constraints object can be created.

[5]:

constraints = discover_df(df)

[6]:

constraints

[6]:

<tdda.constraints.base.DatasetConstraints at 0x127584ec0>

[7]:

constraints.fields

[7]:

Fields([('timestamp', <tdda.constraints.base.FieldConstraints at 0x1275846e0>),
        ('username', <tdda.constraints.base.FieldConstraints at 0x115fe74d0>),
        ('temperature',
         <tdda.constraints.base.FieldConstraints at 0x115fe7d90>),
        ('heartrate', <tdda.constraints.base.FieldConstraints at 0x127594770>),
        ('build', <tdda.constraints.base.FieldConstraints at 0x127594b00>),
        ('latest', <tdda.constraints.base.FieldConstraints at 0x127598b90>),
        ('note', <tdda.constraints.base.FieldConstraints at 0x1160dd9d0>)])

4. Writing the constraints into a file¶

[8]:

with open("../../data/ignore-iot_constraints.tdda", "w") as f:
    f.write(constraints.to_json())

If we take a closer look at the file, we can see that, for example, a string with 19 characters is expected for the timestamp column and temperature expects integers with values from 5-29.

[9]:

!cat ../../data/ignore-iot_constraints.tdda

{
    "creation_metadata": {
        "local_time": "2024-11-02T19:34:53",
        "utc_time": "2024-11-02T18:34:53",
        "creator": "TDDA 2.2.05",
        "host": "fay.local",
        "user": "veit",
        "n_records": 146397,
        "n_selected": 146397
    },
    "fields": {
        "timestamp": {
            "type": "string",
            "min_length": 19,
            "max_length": 19,
            "max_nulls": 0,
            "no_duplicates": true
        },
        "username": {
            "type": "string",
            "min_length": 3,
            "max_length": 21,
            "max_nulls": 0
        },
        "temperature": {
            "type": "int",
            "min": 5,
            "max": 29,
            "sign": "positive",
            "max_nulls": 0
        },
        "heartrate": {
            "type": "int",
            "min": 60,
            "max": 89,
            "sign": "positive",
            "max_nulls": 0
        },
        "build": {
            "type": "string",
            "min_length": 36,
            "max_length": 36,
            "max_nulls": 0,
            "no_duplicates": true
        },
        "latest": {
            "type": "int",
            "min": 0,
            "max": 1,
            "sign": "non-negative",
            "max_nulls": 0
        },
        "note": {
            "type": "string",
            "min_length": 4,
            "max_length": 8,
            "allowed_values": [
                "interval",
                "sleep",
                "test",
                "update",
                "user",
                "wake"
            ]
        }
    }
}

5. Checking data frames¶

To do this, we first read in a new csv file with pandas and then have ten data records output as examples:

[10]:

new_df = pd.read_csv(
    "https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/iot_example_with_nulls.csv"
)
new_df.sample(10)

[10]:

	timestamp	username	temperature	heartrate	build	latest	note
117767	2017-02-17T13:18:01	flynnkimberly	NaN	80	NaN	0.0	NaN
85787	2017-02-04T19:29:23	djohns	NaN	67	e2af7ab7-938e-57e7-975d-441f84052abe	1.0	NaN
128382	2017-02-21T18:59:14	bradley71	26.0	84	86e34cba-6af2-5ae6-bb43-5f58839ddfc4	1.0	NaN
126189	2017-02-20T21:55:00	vprice	12.0	64	780420ce-b902-fe52-d9f8-a46daa269dcc	NaN	interval
107327	2017-02-13T09:35:13	rachelrobinson	27.0	81	9c9cbbb2-22d1-29cf-c08e-ea5014fa22c7	1.0	wake
119362	2017-02-18T04:40:23	diana54	20.0	61	12b0bff3-7598-37f9-a4d9-79a6f58756a8	0.0	test
80993	2017-02-02T21:30:02	tamarafrost	16.0	85	NaN	1.0	wake
46022	2017-01-19T21:31:10	usellers	NaN	78	98e08b68-6fe7-8345-a583-658516f4001b	0.0	NaN
132568	2017-02-23T11:17:09	jeffreyrodgers	NaN	64	NaN	NaN	interval
113676	2017-02-15T22:17:25	mcculloughmichelle	7.0	77	caf5d3ca-734f-6683-0d55-24c07d0c9e33	0.0	test

We see several fields that are output as NaN. Now, to analyse this systematically, we apply verify_df to our new DataFrame. Here, passes returns the number of passed constraints, and failures returns the number of failed constraints.

[11]:

v = verify_df(new_df, '../../data/ignore-iot_constraints.tdda')

[12]:

[12]:

<tdda.constraints.pd.constraints.PandasVerification at 0x127585e80>

[13]:

v.passes

[13]:

[14]:

v.failures

[14]:

We can also display which constraints passed and failed in which columns:

[15]:

print(str(v))

FIELDS:

timestamp: 0 failures  5 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✓  no_duplicates ✓

username: 0 failures  4 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✓

temperature: 1 failure  4 passes  type ✓  min ✓  max ✓  sign ✓  max_nulls ✗

heartrate: 0 failures  5 passes  type ✓  min ✓  max ✓  sign ✓  max_nulls ✓

build: 1 failure  4 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✗  no_duplicates ✓

latest: 1 failure  4 passes  type ✓  min ✓  max ✓  sign ✓  max_nulls ✗

note: 0 failures  4 passes  type ✓  min_length ✓  max_length ✓  allowed_values ✓

SUMMARY:

Constraints passing: 30
Constraints failing: 3

Alternatively, we can also display these results in tabular form:

[16]:

v.to_frame()

[16]:

	field	failures	passes	type	min	min_length	max	max_length	sign	max_nulls	no_duplicates	allowed_values
0	timestamp	0	5	True	NaN	True	NaN	True	NaN	True	True	NaN
1	username	0	4	True	NaN	True	NaN	True	NaN	True	NaN	NaN
2	temperature	1	4	True	True	NaN	True	NaN	True	False	NaN	NaN
3	heartrate	0	5	True	True	NaN	True	NaN	True	True	NaN	NaN
4	build	1	4	True	NaN	True	NaN	True	NaN	False	True	NaN
5	latest	1	4	True	True	NaN	True	NaN	True	False	NaN	NaN
6	note	0	4	True	NaN	True	NaN	True	NaN	NaN	NaN	True