Data Management Plan
I. Types of Data
The data created by this project will be extensive, ranging from self-reported Likert scales to high resolution time series data. A full list of types is as follows:
Electronic Health Record data will be stored in a relational database and include:
- Diagnoses (ICD10 codes)
- Tests & results
- Dates and Times of visits
- Visit Type (inpatient, outpatient, ED)
- Resource utilization statistics (medication usage, refill requests, charges, length of stay)
- Patient-provider interaction from patient portal (message times and dates)
Data from wearables
- Heart rate
- Sleep quality statistics
Data from phone
- Relative Location
- Self reported standardized Likert scales (e.g. mood)
- De-identified vector representations of text from text messages and other social networking tools
- Patient-provider interaction using Phone App (message times and dates)
Data from off body devices
- Air quality
- Movement (from passive infrared sensors)
Social Networking Data
- Size and complexity of the network
- Word and n-tuple frequency
- Word transition frequency
- Other vector space represenations of social networking text
- Patient satisfaction data (from online reviews on social media)
II. Data and Metadata Standards
EHR data will be stored in CSV files, and imported into a Postgress relational database. A standard HL7-compliant schema will be used to represent the data, borrowing from the FHIR standard used to access the data. UMLS or subsumed equivalent ontologies (such as SNOMED, RxNorm and LOINC) will be used to represent each data type.
Time series and event data will be stored in a standard WFDB-compliant format (see www.physionet.org). In general the data are represented as signed, colum-wise 16 bit integers. A baseline offset and gain in units per bit fully define the representation. An ASCII header file associated with each binary file describes the precise formatting and contents and can be transparently read over the web using the open source WFDB libraries. These libraries are able to accurately and consistently represent time series data with varying smaple rates, uneven sampling, missing data and varying dimensions.
“Image data, where appropriate for storage, will be stored as compressed JPEGs (JPEG2000), JPEG-LS (lossless), or processed into a de-identified representation.”
We will use HTML and PDF formats to publish the results and training materials online.
III. Policies for access and sharing and provisions for appropriate protection/privacy
Data will be stored in both the High Performance Computing facility at Gerogia Tech and on Amazon S3. All data will be de-identified at source and the key to map the data back to the individual will be retained by the clinical research teams.
De-identified data will be shared either through direct application for access through the Hub, or via the NIH Commons mode ( https://datascience.nih.gov/commons ).
A public website will be constructed by the research engineer and results disseminated through this site. Access may also be request through the site, or through the Hub. There will be no charge for accessing data, although compute time will be charged at a market rate by the data host if this is requested.
There are no ethical and privacy issues and the data are not ‘personal data’ in terms of the Data Protection Act 1998 (the DPA) or equivalent HIPAA requirement) because the data wil be de-identified at source. Appropriate IRB approval will be sought at study sites and consent will be gained through either the phone app, an online form or an in person study rep. The data is not copyrighted and no licenses pertain to it.
IV. Policies and Provisions for Re-use, Re-distribution
There will be no permission restriction placed on the data. Health and human behavior researchers are the most likely consumers of this data. The intended or foreseeable uses / users of the data would be those seeking to improve health outcomes. Therefore, there are no reasons not to share or re-use data.
V. Plans for Archiving and Preservation of Access
We have a dual plan for archiving and preserving data. (1) Multiple public data storage facilities exist including PhysioNet and the NIH Commons.( 2) The use of a commercial partner (Amazon) creates the potential for a sustainable model for long term storage. Commercial partners are likely to store the data if they can charge for compute time (and possibly data egress if large volumes of data are needed by a user). There are no transformations necessary to prepare the data for preservation / data sharing.
VI. Roles and Responsibilities
The PI will be responsible for all data management and policy enforcement. In the event of departure of the PI, data curation will be passed to PhysioNet, since the PI has a long-standing relationship with that resource.