Skip to main content Link Menu Expand (external link) Document Search Copy Copied

Scrubbing User Data

We want to ensure that the data that we analyze is scrubbed so that no trace-able information is related to individual data points.

Import Dependencies

# import dependencies
import numpy as np
import matplotlib.pyplot as plt
import scipy.optimize as sci
import pandas as pd
from pathlib import Path
import uuid
from IPython.display import display, HTML


def disp(df):
    display(HTML(df.to_html()))

Load our raw data in the form of CSV into data frames. Raw data should be located in ./raw-data/.

file_name = input("What is the name of the csv file?")
file_path = "./raw-data/" + file_name
df = pd.read_csv(file_path)

Prune duplicate emails from the dataset. This is an easy way to know if someone has submitted multiple responses

col_name = (
    "(Optional) Provide your email for a chance to win a $20.00 Tim Horton's gift card"
)

df.drop_duplicates(
    subset=[col_name],
    keep="last",
)
Timestamp What do you think is an acceptable amount of time to wait for services at your pharmacy? How long do you usually wait at the pharmacy? Have you used Amazon Lockers or a similar pick-up Lockers for before? How likely are you to use an automated pick-up locker to pick up your prescriptions? Email Address What is your age range? (Optional) Provide your email for a chance to win a $20.00 Tim Horton's gift card How often do you usually visit pharmacies in a given year Think about the last time you went to the pharmacy. How would you best describe your experience?
2 10/5/2022 18:48:45 11-15 minutes 0-3 minutes No Somewhat likely NaN Under 21 zoe.cushman@protonmail.com NaN NaN
3 10/5/2022 19:10:21 4-6 minutes more than 15 minutes No Very likely NaN 21 - 35 nbudatho@uwaterloo.ca NaN NaN
4 10/5/2022 19:20:25 4-6 minutes 7-9 minutes No Not very likely NaN 21 - 35 glmdenney17@gmail.com NaN NaN
5 10/5/2022 19:25:40 4-6 minutes 0-3 minutes No Likely NaN 21 - 35 m.balghonaim@gmail.com NaN NaN
6 10/5/2022 19:32:27 4-6 minutes 7-9 minutes No Very likely NaN 21 - 35 jjwilkin@uwaterloo.ca NaN NaN
... ... ... ... ... ... ... ... ... ... ...
2831 10/17/2022 1:07:53 7-9 minutes 11-15 minutes Yes Likely NaN 40 - 64 anunley661@gmail.com NaN NaN
2832 10/17/2022 1:08:06 7-9 minutes 11-15 minutes Yes Likely NaN 40 - 64 alantheisen106@gmail.com NaN NaN
2833 10/17/2022 1:09:47 4-6 minutes 7-9 minutes Yes Likely NaN 35 - 40 rjudson123@gmail.com NaN NaN
2834 10/17/2022 1:10:11 4-6 minutes 7-9 minutes Yes Likely NaN 35 - 40 hsylvester837@gmail.com NaN NaN
2835 10/17/2022 1:11:43 4-6 minutes 7-9 minutes Yes Likely NaN 35 - 40 ddominy87@gmail.com NaN NaN

2706 rows × 10 columns

Collect the emails and store them separate from data

emails = pd.DataFrame(df[col_name])

# rename columns for sanity
emails.rename(
    columns={col_name: "email"},
    inplace=True,
)
# drop duplicates
emails.drop_duplicates()

# drop blanks
emails = emails.dropna()

# scamble emails so that are randomly out of order
emails = emails.sample(frac=1).reset_index(drop=True)

# write file to .csv
email_output_file_name = "emails_for_reward.csv"
filepath = Path("./output/" + email_output_file_name)
filepath.parent.mkdir(parents=True, exist_ok=True)
emails.to_csv(filepath)

With our emails saved, we can now scrubb our data and clean it up so we can use it for analysis

# shuffle the rows and re-index
df = df.sample(frac=1).reset_index(drop=True)

# drop columns containing tracable information
df = df.drop(
    columns=[
        col_name,
        "Email Address",
        "Timestamp",
        "How often do you usually visit pharmacies in a given year",
        "Think about the last time you went to the pharmacy. How would you best describe your experience?",
    ]
)


# add a UUID identifier to each row
df.insert(0, "id", "")
df["id"] = [uuid.uuid4() for _ in range(len(df.index))]

Save the data to /output as patient-survey-data.csv

# write file to .csv
data_file_name = "patient-survey-data.csv"
filepath = Path("./output/" + data_file_name)
filepath.parent.mkdir(parents=True, exist_ok=True)
df.to_csv(filepath)