Install and load necessary libraries for this example

library(regstudies)
library(tidyverse)
library(vroom)
library(lubridate)

We need to also create a new help variable icd which determines if code is ICD-9, ICD-10 or some other format by year:

d <- left_join(sample_cohort,sample_regdata)
## Joining, by = "personid"

Data wrangling for register data sets

Example data set

The regstudies package also provides useful tools for modifying and filtering the register data sets. Typically in register studies the data consists of events, which are stored in data as time stamps. In our example data set the time stamps are date of hospital visits. Also, in register studies the data are commonly studied with respect to index date, which is a date of interest that can vary by individuals.

Our example data is simulated data that represents a postal questionnaire survey study which has been extended to register data. There are two data sets, the sample_cohort is holding the study group, and sample_regdata holds the hospital visits for the study group members.

More precisely, the sample_cohort lists the individual id study numbers as id for whom the register data has been collected. It also list the postingdate holding the date of submitting the postal questionnaire. The sample_regdata contains also variable id which is necessary to be able to link the data sets. The sample_regdata also contains variables holding the diagnosis codes at CODE1 and times of hospital admission adm_date and hospital discharge disc_date.

head(sample_cohort)
## # A tibble: 6 x 3
##   personid gender postingdate
##      <int>  <dbl> <date>     
## 1     1101      2 2000-01-01 
## 2     1102      1 2000-01-01 
## 3     1103      1 2000-01-01 
## 4     1104      1 2000-01-01 
## 5     1105      2 2000-01-01 
## 6     1106      2 2000-01-01
head(sample_regdata)
## # A tibble: 6 x 5
##   personid CODE1 adm_date   disc_date  icd  
##      <int> <chr> <date>     <date>     <chr>
## 1     1101 196   1990-03-28 1990-05-20 icd9 
## 2     1101 I26   1990-04-25 1990-10-01 icd9 
## 3     1101 E222  1990-06-28 1990-08-07 icd9 
## 4     1101 39891 1990-09-19 1991-03-05 icd9 
## 5     1101 G10   1990-11-24 1991-01-04 icd9 
## 6     1101 I278  1991-11-01 1992-01-16 icd9

Filtering

Filtering by dates

Filter dataset with either adm_date or disc_date within two years around postingdate

filtered_d <- d %>%
  regstudies::filter_date_in_ival(date_var=adm_date,
                                  index_date=postingdate,
                                  time_before=years(2),
                                  time_after=days(0)
  )
head(filtered_d)
## # A tibble: 6 x 7
##   personid gender postingdate CODE1 adm_date   disc_date  icd  
##      <int>  <dbl> <date>      <chr> <date>     <date>     <chr>
## 1     1101      2 2000-01-01  2409  1999-04-30 1999-08-25 icd10
## 2     1101      2 2000-01-01  K257  1999-12-01 2000-04-27 icd10
## 3     1103      1 2000-01-01  2801  1998-11-23 1999-01-19 icd10
## 4     1103      1 2000-01-01  4168  1999-11-16 1999-12-05 icd10
## 5     1104      1 2000-01-01  2780  1998-10-25 1998-11-03 icd10
## 6     1104      1 2000-01-01  2780  1998-12-26 1999-05-09 icd10

Filtering by date intervals

Filter data set with occurrences of hospital visits. If person has been in hospital during the time period two years prior postingdate then those events are included.

filtered_d <- d %>%
  filter_ival_olap_ival(ival_var=lubridate::interval(adm_date,disc_date),
                        index_date = postingdate,
                        time_before=years(2),
                        time_after=days(0))
head(filtered_d)
## # A tibble: 6 x 7
##   personid gender postingdate CODE1 adm_date   disc_date  icd  
##      <int>  <dbl> <date>      <chr> <date>     <date>     <chr>
## 1     1101      2 2000-01-01  2409  1999-04-30 1999-08-25 icd10
## 2     1101      2 2000-01-01  K257  1999-12-01 2000-04-27 icd10
## 3     1103      1 2000-01-01  2409  1997-10-02 1998-02-25 icd10
## 4     1103      1 2000-01-01  2652  1997-12-13 1998-06-03 icd10
## 5     1103      1 2000-01-01  2801  1998-11-23 1999-01-19 icd10
## 6     1103      1 2000-01-01  4168  1999-11-16 1999-12-05 icd10

Complex filtering

Lets say that we want to find first occasions of disease codes such that we want to obtain first event for each individual in each class. We will use Elixhauser classes in this example.

elixh_d <- d %>%
  classify_elixhauser(CODE1)
## Joining, by = "label_elixhauser"
head(elixh_d)
## # A tibble: 6 x 11
##   personid gender postingdate CODE1 adm_date   disc_date  icd   class_elixhauser
##      <int>  <dbl> <date>      <chr> <date>     <date>     <chr> <chr>           
## 1     1101      2 2000-01-01  196   1990-03-28 1990-05-20 icd9  metacanc        
## 2     1101      2 2000-01-01  I26   1990-04-25 1990-10-01 icd9  <NA>            
## 3     1101      2 2000-01-01  E222  1990-06-28 1990-08-07 icd9  <NA>            
## 4     1101      2 2000-01-01  39891 1990-09-19 1991-03-05 icd9  chf             
## 5     1101      2 2000-01-01  G10   1990-11-24 1991-01-04 icd9  <NA>            
## 6     1101      2 2000-01-01  I278  1991-11-01 1992-01-16 icd9  <NA>            
## # … with 3 more variables: label_elixhauser <chr>, score_AHRQ <dbl>,
## #   score_van_Walraven <dbl>
library(regstudies)
first_events_d <- elixh_d %>%
  filter(!is.na(class_elixhauser)) %>%
  group_by(personid,class_elixhauser) %>%
  mutate(min_date=min(adm_date,disc_date)) %>%
  filter_date_first(min_date)
first_events_d
## # A tibble: 4,110 x 12
## # Groups:   personid, class_elixhauser [4,110]
##    personid gender postingdate CODE1 adm_date   disc_date  icd  
##       <int>  <dbl> <date>      <chr> <date>     <date>     <chr>
##  1     1101      2 2000-01-01  39891 1990-09-19 1991-03-05 icd9 
##  2     1101      2 2000-01-01  2409  1991-12-19 1992-05-27 icd9 
##  3     1101      2 2000-01-01  C81   1995-10-04 1996-01-16 icd10
##  4     1101      2 2000-01-01  196   1990-03-28 1990-05-20 icd9 
##  5     1101      2 2000-01-01  3341  1993-07-21 1993-11-13 icd9 
##  6     1101      2 2000-01-01  3341  1993-07-21 1993-11-13 icd9 
##  7     1101      2 2000-01-01  4150  1992-03-23 1992-08-13 icd9 
##  8     1101      2 2000-01-01  K257  1999-12-01 2000-04-27 icd10
##  9     1101      2 2000-01-01  0930  1994-12-18 1995-05-26 icd9 
## 10     1101      2 2000-01-01  140   1994-06-02 1994-08-17 icd9 
## # … with 4,100 more rows, and 5 more variables: class_elixhauser <chr>,
## #   label_elixhauser <chr>, score_AHRQ <dbl>, score_van_Walraven <dbl>,
## #   min_date <date>

Left join with filling missing values

In register studies a situation often occurs where the left_join creates missing data to to data. This can happen for example when joining sample_cohort to sample_regdata as left_join(sample_cohort,sample_regdata) when there are individuals in cohort who do not hold any records in the register data set. In sample data there are 100 such individuals. Usually this is just a nuisance and we want to fill those values with 0 or false or what ever suits our use case.

For this problem there is function left_join_replace_na for which user can give a list of default values what he or she wants to use.

d<-left_join(sample_cohort,sample_regdata) %>%
  arrange(personid)
## Joining, by = "personid"
##     personid        gender       postingdate            CODE1          
##  Min.   :1101   Min.   :1.000   Min.   :2000-01-01   Length:10100      
##  1st Qu.:1212   1st Qu.:1.000   1st Qu.:2000-01-01   Class :character  
##  Median :1305   Median :2.000   Median :2000-01-01   Mode  :character  
##  Mean   :1305   Mean   :1.539   Mean   :2000-01-01                     
##  3rd Qu.:1400   3rd Qu.:2.000   3rd Qu.:2000-01-01                     
##  Max.   :1500   Max.   :2.000   Max.   :2000-01-01                     
##                                                                        
##     adm_date            disc_date              icd           
##  Min.   :1990-01-01   Min.   :1990-01-18   Length:10100      
##  1st Qu.:1993-12-18   1st Qu.:1994-03-21   Class :character  
##  Median :1998-01-12   Median :1998-04-30   Mode  :character  
##  Mean   :1998-01-04   Mean   :1998-04-04                     
##  3rd Qu.:2001-12-27   3rd Qu.:2002-03-25                     
##  Max.   :2005-12-30   Max.   :2006-06-24                     
##  NA's   :100          NA's   :100

This time the missing values are filled based on the defaults entered to na_replace_list.

na_replace_list <- list(
 "logical"=FALSE,
 "character"="",
 "integer"=0L,
 "double"=0.0,
 "numeric"=0,
 "Date"=lubridate::dmy("01-01-1900")
)
d_filled<-left_join_replace_na(sample_cohort,sample_regdata,na_replace_list)
## Joining, by = "personid"
summary(d_filled)
##     personid        gender       postingdate       CODE1          
##  Min.   :1101   Min.   :1.000   Min.   :10957   Length:10100      
##  1st Qu.:1212   1st Qu.:1.000   1st Qu.:10957   Class :character  
##  Median :1305   Median :2.000   Median :10957   Mode  :character  
##  Mean   :1305   Mean   :1.539   Mean   :10957                     
##  3rd Qu.:1400   3rd Qu.:2.000   3rd Qu.:10957                     
##  Max.   :1500   Max.   :2.000   Max.   :10957                     
##     adm_date        disc_date          icd           
##  Min.   :-25567   Min.   :-25567   Length:10100      
##  1st Qu.:  8709   1st Qu.:  8802   Class :character  
##  Median : 10212   Median : 10310   Mode  :character  
##  Mean   :  9876   Mean   :  9965                     
##  3rd Qu.: 11667   3rd Qu.: 11754                     
##  Max.   : 13147   Max.   : 13323

Internally the filling of missing values with default values is executed by another function replace_na_by_type which is also available for users.

## Give list of default fill values overridding NA
d_filled <- d %>%
 mutate_all(replace_na_by_type,na_replace_list)
summary(d_filled)
##     personid        gender       postingdate       CODE1          
##  Min.   :1101   Min.   :1.000   Min.   :10957   Length:10100      
##  1st Qu.:1212   1st Qu.:1.000   1st Qu.:10957   Class :character  
##  Median :1305   Median :2.000   Median :10957   Mode  :character  
##  Mean   :1305   Mean   :1.539   Mean   :10957                     
##  3rd Qu.:1400   3rd Qu.:2.000   3rd Qu.:10957                     
##  Max.   :1500   Max.   :2.000   Max.   :10957                     
##     adm_date        disc_date          icd           
##  Min.   :-25567   Min.   :-25567   Length:10100      
##  1st Qu.:  8709   1st Qu.:  8802   Class :character  
##  Median : 10212   Median : 10310   Mode  :character  
##  Mean   :  9876   Mean   :  9965                     
##  3rd Qu.: 11667   3rd Qu.: 11754                     
##  Max.   : 13147   Max.   : 13323