PyMonkey: [SnakeData]Usecase - How to perform simple data validations

сряда, 29 януари 2020 г.

[SnakeData]Usecase - How to perform simple data validations

Now we will step slightly aside of the straight-forward function and file overviews and we'll see an example of possible application of the SnakeData system. For that purpose I have prepared 3 csv data sets representing 3 separate waves of an imaginary study with intensional logic errors in one of the sets. Csv data files & data generator file /data generator file was made on Libreoffice's Calc but it should be compatible with Excel/ and case1 main system.

In many online studies we have the below question:

S3. Do you have children below 18 living in your household?

1. Yes

2. No

3. Prefer not to answer

In the data this question will be presented as single variable holding numbers 1, 2 and 3 representing each of the answers above. If a respondent answers with "Yes" at S3 then are asked the next question:

S4. How many children below 18 do you have?

1. 1

2. 2

3. 3

4. 4 or more

5. Prefer not to answer

Analogically to previous question we have a single variable with values from 1 to 5. In this scenario we should have, if the logic is applied strictly, values from 1 to 5 in S4 if S3 has 1 as value for a given record. And vice versa, if S3 is equal to 2 or 3 S4 should not have data /SnakeData applies None type to blank spaces/. So let's begin with creating new database:

As seen on the left, create_db() function creates database with default folders described here /same as in db1/. The name of the new database is typed in the parentheses within quotation /quotations tell Python what's between them is a string/. The newly created database is automatically selected and we can start work immediately with it. What we are going to do next is to use the run_script() function... and yes, it executes predefined Python script:

Similarly to select_db() the script name is the only argument of run_script() and is typed in quotation within parentheses. It runs properly only if the script file is located in folder 'scripts'. The script itself is based on the SnakeData library and its purpose is to "catch" the records with logic errors at S3 and S4. And below is the example.py:

example.py Python 3 code

read_all_data() function comes from SnakeData and has just one argument - the name of the template file /simply said - the columns that will be extracted from available data/. This time I created template with all variables because the variable set is quite small /just 6 data variables/ and system speed will not be increased significantly if I select just the 3 variables needed. The template file itself is a csv file with a single row containing the column names to be loaded:

all.csv data template

read_all_data() is essential function /as well as read_custom_data()/ and is extracting the data from all csv files in the "data" folder. The function returns in fact a dictionary where variable names /first row in the file/ become dictionary keys and the data are stored as lists in dictionary values. Explained in other words, the numbers in the column below S1 are stored in a list assigned as value of the "S1" key in the dictionary. And the situation is the same for S2, S3, etc. The function adds 3 system variables to the dictionary: global_rec /stands for global record, holds the record number within the entire data/, Marker/holds the name of the data file from which the record comes/ and local_rec/holds the record number within a single data file, i.e. when records from one file end records for the next file begin with 1/. These 3 variables are added at the begining of the data and next to them begins loading of actual data.The difference between global and local rec:

On line 3 of the code I add key "Error_flag" with empty list assigned. Its function will be to mark all records with logic errors so the user can easily identify these cases after exporting the data. The next line of code initiates for loop in order to iterate over all elements of the lists we need. I have also used one priceless built-in method here - zip(), allowing us to iterate over 2 or more iterables (in our case the iterables are lists) with the same length. On this line "a" becomes an element of data['Marker'], "b" becomes an element of data['uid'], etc. Although Marker and uid are not needed for the purpose of data validation, we need them to identify where are the wrong records. Next we move to conditions and we have to check two conditions in general: in case S3 is bigger than 1 we should not have data on the same position in S4, and vice versa, if S3=1 we should see integer values in the range 1 to 5. As mentioned before /or maybe not, my memory is playing tricks on me sometimes.../ if read_all_data detects a blank space it replaces the blank with None. On line 5 we check the straight logic - the S4 list should contain only None on the same position where S3 is bigger than 1. If that's not the situation we ask Python to print error message and to append 1 on the same position in Error_flag. On line 8 we check the case when S3=1: then if S4 is equal to None or not in range 1 to 5 on the same position we print different error message again and append 2 to Error_flag in order to flag different type of error. In any other case, i.e. when everythinsg's correct with the data, we append 0 to Error_flag. At the very end of the script is yet another function that says enough with its name - export_to_csv(). It has two arguments - the dictionary holding the data and the name of the export file. Line 14 generates a csv export file holding the data we already have + 3 system variables at the beginning and one error flagging variable at the end:

And if we filter out zero values in Error_flag we will see the same 4 guys displayed by the terminal:

So, to summarize all: there are four records with logical errors in the data. Two of them have excessive data in S4 and two other records are on the other end - there is no data when it is supposed to have.
And at the end I'd like to point out again something important - the data is completely fake and generated by me on Calc /LibreOffice/ with the purpose to illustrate potential application of SnakeData. It is definitely NOT a real world data.

Няма коментари:

Публикуване на коментар

Абонамент за: Коментари за публикацията (Atom)