TDDA: Test-Driven Data Analysis¶
TDDA uses file inputs (such as NumPy arrays or Pandas DataFrames) and a set of constraints that are stored as a JSON file.
Reference Tests supports the creation of reference tests based on either unittest or pytest.
Constraints is used to retrieve constraints from a (pandas) DataFrame, write them out as JSON and check whether records satisfy the constraints in the constraints file. It also supports tables in a variety of relational databases.
Rexpy is a tool for automatically deriving regular expressions from a column in a pandas DataFrame or from a (Python) list of examples.
1. Imports¶
[1]:
import numpy as np
import pandas as pd
from tdda.constraints import discover_df, verify_df
[2]:
df = pd.read_csv(
"https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/iot_example.csv"
)
2. Check data¶
With pandas.DataFrame.sample we display ten random data sets:
[3]:
df.sample(10)
[3]:
| timestamp | username | temperature | heartrate | build | latest | note | |
|---|---|---|---|---|---|---|---|
| 134749 | 2017-02-24T08:10:53 | michellereed | 11 | 77 | 24e04514-95c1-823b-1559-eda8a3eb7b35 | 0 | update |
| 53301 | 2017-01-22T19:17:53 | donaldjohnson | 28 | 84 | 256c0202-0696-6ecf-5979-13d1886a3120 | 0 | user |
| 79145 | 2017-02-02T03:45:05 | derek53 | 25 | 63 | 8b03fa9d-5992-22f6-9977-b668b3ffc3c6 | 1 | test |
| 77111 | 2017-02-01T07:59:34 | jason72 | 24 | 84 | 487965ae-c546-eef9-43ef-c09d4780dfdc | 0 | user |
| 99493 | 2017-02-10T06:38:47 | tonyahopkins | 18 | 74 | 30287b8f-afbb-e2e8-7e8a-4ce35191fdaa | 0 | sleep |
| 74375 | 2017-01-31T05:37:53 | wholmes | 24 | 75 | 52428d30-6f5d-2c95-b6a9-82f5a587d47a | 0 | interval |
| 144199 | 2017-02-28T03:01:21 | velazquezchristina | 22 | 71 | 91b3a39a-26cc-1a22-9d92-946d6bbe8b12 | 0 | interval |
| 50757 | 2017-01-21T18:52:32 | sdonovan | 5 | 77 | 1b18d840-dc92-3f8b-85a1-8904a00c697c | 0 | test |
| 127798 | 2017-02-21T13:21:39 | mossdavid | 14 | 62 | b80db398-98d4-061a-d29a-284dac7edc90 | 0 | interval |
| 96068 | 2017-02-08T21:53:09 | veronicaanderson | 17 | 78 | c344b7b5-b032-757b-21e2-cc9762fcd6d5 | 1 | test |
And with pandas.DataFrame.dtypes we display the data types for the individual columns:
[4]:
df.dtypes
[4]:
timestamp object
username object
temperature int64
heartrate int64
build object
latest int64
note object
dtype: object
3. Creating a constraints object¶
With discover_constraints a constraints object can be created.
[5]:
constraints = discover_df(df)
[6]:
constraints
[6]:
<tdda.constraints.base.DatasetConstraints at 0x127584ec0>
[7]:
constraints.fields
[7]:
Fields([('timestamp', <tdda.constraints.base.FieldConstraints at 0x1275846e0>),
('username', <tdda.constraints.base.FieldConstraints at 0x115fe74d0>),
('temperature',
<tdda.constraints.base.FieldConstraints at 0x115fe7d90>),
('heartrate', <tdda.constraints.base.FieldConstraints at 0x127594770>),
('build', <tdda.constraints.base.FieldConstraints at 0x127594b00>),
('latest', <tdda.constraints.base.FieldConstraints at 0x127598b90>),
('note', <tdda.constraints.base.FieldConstraints at 0x1160dd9d0>)])
4. Writing the constraints into a file¶
[8]:
with open("../../data/ignore-iot_constraints.tdda", "w") as f:
f.write(constraints.to_json())
If we take a closer look at the file, we can see that, for example, a string with 19 characters is expected for the timestamp column and temperature expects integers with values from 5-29.
[9]:
!cat ../../data/ignore-iot_constraints.tdda
{
"creation_metadata": {
"local_time": "2024-11-02T19:34:53",
"utc_time": "2024-11-02T18:34:53",
"creator": "TDDA 2.2.05",
"host": "fay.local",
"user": "veit",
"n_records": 146397,
"n_selected": 146397
},
"fields": {
"timestamp": {
"type": "string",
"min_length": 19,
"max_length": 19,
"max_nulls": 0,
"no_duplicates": true
},
"username": {
"type": "string",
"min_length": 3,
"max_length": 21,
"max_nulls": 0
},
"temperature": {
"type": "int",
"min": 5,
"max": 29,
"sign": "positive",
"max_nulls": 0
},
"heartrate": {
"type": "int",
"min": 60,
"max": 89,
"sign": "positive",
"max_nulls": 0
},
"build": {
"type": "string",
"min_length": 36,
"max_length": 36,
"max_nulls": 0,
"no_duplicates": true
},
"latest": {
"type": "int",
"min": 0,
"max": 1,
"sign": "non-negative",
"max_nulls": 0
},
"note": {
"type": "string",
"min_length": 4,
"max_length": 8,
"allowed_values": [
"interval",
"sleep",
"test",
"update",
"user",
"wake"
]
}
}
}
5. Checking data frames¶
To do this, we first read in a new csv file with pandas and then have ten data records output as examples:
[10]:
new_df = pd.read_csv(
"https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/iot_example_with_nulls.csv"
)
new_df.sample(10)
[10]:
| timestamp | username | temperature | heartrate | build | latest | note | |
|---|---|---|---|---|---|---|---|
| 117767 | 2017-02-17T13:18:01 | flynnkimberly | NaN | 80 | NaN | 0.0 | NaN |
| 85787 | 2017-02-04T19:29:23 | djohns | NaN | 67 | e2af7ab7-938e-57e7-975d-441f84052abe | 1.0 | NaN |
| 128382 | 2017-02-21T18:59:14 | bradley71 | 26.0 | 84 | 86e34cba-6af2-5ae6-bb43-5f58839ddfc4 | 1.0 | NaN |
| 126189 | 2017-02-20T21:55:00 | vprice | 12.0 | 64 | 780420ce-b902-fe52-d9f8-a46daa269dcc | NaN | interval |
| 107327 | 2017-02-13T09:35:13 | rachelrobinson | 27.0 | 81 | 9c9cbbb2-22d1-29cf-c08e-ea5014fa22c7 | 1.0 | wake |
| 119362 | 2017-02-18T04:40:23 | diana54 | 20.0 | 61 | 12b0bff3-7598-37f9-a4d9-79a6f58756a8 | 0.0 | test |
| 80993 | 2017-02-02T21:30:02 | tamarafrost | 16.0 | 85 | NaN | 1.0 | wake |
| 46022 | 2017-01-19T21:31:10 | usellers | NaN | 78 | 98e08b68-6fe7-8345-a583-658516f4001b | 0.0 | NaN |
| 132568 | 2017-02-23T11:17:09 | jeffreyrodgers | NaN | 64 | NaN | NaN | interval |
| 113676 | 2017-02-15T22:17:25 | mcculloughmichelle | 7.0 | 77 | caf5d3ca-734f-6683-0d55-24c07d0c9e33 | 0.0 | test |
We see several fields that are output as NaN. Now, to analyse this systematically, we apply verify_df to our new DataFrame. Here, passes returns the number of passed constraints, and failures returns the number of failed constraints.
[11]:
v = verify_df(new_df, '../../data/ignore-iot_constraints.tdda')
[12]:
v
[12]:
<tdda.constraints.pd.constraints.PandasVerification at 0x127585e80>
[13]:
v.passes
[13]:
30
[14]:
v.failures
[14]:
3
We can also display which constraints passed and failed in which columns:
[15]:
print(str(v))
FIELDS:
timestamp: 0 failures 5 passes type ✓ min_length ✓ max_length ✓ max_nulls ✓ no_duplicates ✓
username: 0 failures 4 passes type ✓ min_length ✓ max_length ✓ max_nulls ✓
temperature: 1 failure 4 passes type ✓ min ✓ max ✓ sign ✓ max_nulls ✗
heartrate: 0 failures 5 passes type ✓ min ✓ max ✓ sign ✓ max_nulls ✓
build: 1 failure 4 passes type ✓ min_length ✓ max_length ✓ max_nulls ✗ no_duplicates ✓
latest: 1 failure 4 passes type ✓ min ✓ max ✓ sign ✓ max_nulls ✗
note: 0 failures 4 passes type ✓ min_length ✓ max_length ✓ allowed_values ✓
SUMMARY:
Constraints passing: 30
Constraints failing: 3
Alternatively, we can also display these results in tabular form:
[16]:
v.to_frame()
[16]:
| field | failures | passes | type | min | min_length | max | max_length | sign | max_nulls | no_duplicates | allowed_values | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | timestamp | 0 | 5 | True | NaN | True | NaN | True | NaN | True | True | NaN |
| 1 | username | 0 | 4 | True | NaN | True | NaN | True | NaN | True | NaN | NaN |
| 2 | temperature | 1 | 4 | True | True | NaN | True | NaN | True | False | NaN | NaN |
| 3 | heartrate | 0 | 5 | True | True | NaN | True | NaN | True | True | NaN | NaN |
| 4 | build | 1 | 4 | True | NaN | True | NaN | True | NaN | False | True | NaN |
| 5 | latest | 1 | 4 | True | True | NaN | True | NaN | True | False | NaN | NaN |
| 6 | note | 0 | 4 | True | NaN | True | NaN | True | NaN | NaN | NaN | True |