Install and load necessary libraries for this example
We need to also create a new help variable icd
which determines if code is ICD-9, ICD-10 or some other format by year:
d <- left_join(sample_cohort,sample_regdata)
## Joining, by = "personid"
The regstudies package also provides useful tools for modifying and filtering the register data sets. Typically in register studies the data consists of events, which are stored in data as time stamps. In our example data set the time stamps are date of hospital visits. Also, in register studies the data are commonly studied with respect to index date, which is a date of interest that can vary by individuals.
Our example data is simulated data that represents a postal questionnaire survey study which has been extended to register data. There are two data sets, the sample_cohort
is holding the study group, and sample_regdata
holds the hospital visits for the study group members.
More precisely, the sample_cohort
lists the individual id study numbers as id
for whom the register data has been collected. It also list the postingdate
holding the date of submitting the postal questionnaire. The sample_regdata
contains also variable id
which is necessary to be able to link the data sets. The sample_regdata
also contains variables holding the diagnosis codes at CODE1
and times of hospital admission adm_date
and hospital discharge disc_date
.
head(sample_cohort)
## # A tibble: 6 x 3
## personid gender postingdate
## <int> <dbl> <date>
## 1 1101 2 2000-01-01
## 2 1102 1 2000-01-01
## 3 1103 1 2000-01-01
## 4 1104 1 2000-01-01
## 5 1105 2 2000-01-01
## 6 1106 2 2000-01-01
head(sample_regdata)
## # A tibble: 6 x 5
## personid CODE1 adm_date disc_date icd
## <int> <chr> <date> <date> <chr>
## 1 1101 196 1990-03-28 1990-05-20 icd9
## 2 1101 I26 1990-04-25 1990-10-01 icd9
## 3 1101 E222 1990-06-28 1990-08-07 icd9
## 4 1101 39891 1990-09-19 1991-03-05 icd9
## 5 1101 G10 1990-11-24 1991-01-04 icd9
## 6 1101 I278 1991-11-01 1992-01-16 icd9
Filter dataset with either adm_date
or disc_date
within two years around postingdate
filtered_d <- d %>% regstudies::filter_date_in_ival(date_var=adm_date, index_date=postingdate, time_before=years(2), time_after=days(0) ) head(filtered_d)
## # A tibble: 6 x 7
## personid gender postingdate CODE1 adm_date disc_date icd
## <int> <dbl> <date> <chr> <date> <date> <chr>
## 1 1101 2 2000-01-01 2409 1999-04-30 1999-08-25 icd10
## 2 1101 2 2000-01-01 K257 1999-12-01 2000-04-27 icd10
## 3 1103 1 2000-01-01 2801 1998-11-23 1999-01-19 icd10
## 4 1103 1 2000-01-01 4168 1999-11-16 1999-12-05 icd10
## 5 1104 1 2000-01-01 2780 1998-10-25 1998-11-03 icd10
## 6 1104 1 2000-01-01 2780 1998-12-26 1999-05-09 icd10
Filter data set with occurrences of hospital visits. If person has been in hospital during the time period two years prior postingdate
then those events are included.
filtered_d <- d %>% filter_ival_olap_ival(ival_var=lubridate::interval(adm_date,disc_date), index_date = postingdate, time_before=years(2), time_after=days(0)) head(filtered_d)
## # A tibble: 6 x 7
## personid gender postingdate CODE1 adm_date disc_date icd
## <int> <dbl> <date> <chr> <date> <date> <chr>
## 1 1101 2 2000-01-01 2409 1999-04-30 1999-08-25 icd10
## 2 1101 2 2000-01-01 K257 1999-12-01 2000-04-27 icd10
## 3 1103 1 2000-01-01 2409 1997-10-02 1998-02-25 icd10
## 4 1103 1 2000-01-01 2652 1997-12-13 1998-06-03 icd10
## 5 1103 1 2000-01-01 2801 1998-11-23 1999-01-19 icd10
## 6 1103 1 2000-01-01 4168 1999-11-16 1999-12-05 icd10
Lets say that we want to find first occasions of disease codes such that we want to obtain first event for each individual in each class. We will use Elixhauser classes in this example.
elixh_d <- d %>% classify_elixhauser(CODE1)
## Joining, by = "label_elixhauser"
head(elixh_d)
## # A tibble: 6 x 11
## personid gender postingdate CODE1 adm_date disc_date icd class_elixhauser
## <int> <dbl> <date> <chr> <date> <date> <chr> <chr>
## 1 1101 2 2000-01-01 196 1990-03-28 1990-05-20 icd9 metacanc
## 2 1101 2 2000-01-01 I26 1990-04-25 1990-10-01 icd9 <NA>
## 3 1101 2 2000-01-01 E222 1990-06-28 1990-08-07 icd9 <NA>
## 4 1101 2 2000-01-01 39891 1990-09-19 1991-03-05 icd9 chf
## 5 1101 2 2000-01-01 G10 1990-11-24 1991-01-04 icd9 <NA>
## 6 1101 2 2000-01-01 I278 1991-11-01 1992-01-16 icd9 <NA>
## # … with 3 more variables: label_elixhauser <chr>, score_AHRQ <dbl>,
## # score_van_Walraven <dbl>
library(regstudies) first_events_d <- elixh_d %>% filter(!is.na(class_elixhauser)) %>% group_by(personid,class_elixhauser) %>% mutate(min_date=min(adm_date,disc_date)) %>% filter_date_first(min_date) first_events_d
## # A tibble: 4,110 x 12
## # Groups: personid, class_elixhauser [4,110]
## personid gender postingdate CODE1 adm_date disc_date icd
## <int> <dbl> <date> <chr> <date> <date> <chr>
## 1 1101 2 2000-01-01 39891 1990-09-19 1991-03-05 icd9
## 2 1101 2 2000-01-01 2409 1991-12-19 1992-05-27 icd9
## 3 1101 2 2000-01-01 C81 1995-10-04 1996-01-16 icd10
## 4 1101 2 2000-01-01 196 1990-03-28 1990-05-20 icd9
## 5 1101 2 2000-01-01 3341 1993-07-21 1993-11-13 icd9
## 6 1101 2 2000-01-01 3341 1993-07-21 1993-11-13 icd9
## 7 1101 2 2000-01-01 4150 1992-03-23 1992-08-13 icd9
## 8 1101 2 2000-01-01 K257 1999-12-01 2000-04-27 icd10
## 9 1101 2 2000-01-01 0930 1994-12-18 1995-05-26 icd9
## 10 1101 2 2000-01-01 140 1994-06-02 1994-08-17 icd9
## # … with 4,100 more rows, and 5 more variables: class_elixhauser <chr>,
## # label_elixhauser <chr>, score_AHRQ <dbl>, score_van_Walraven <dbl>,
## # min_date <date>
In register studies a situation often occurs where the left_join
creates missing data to to data. This can happen for example when joining sample_cohort
to sample_regdata
as left_join(sample_cohort,sample_regdata)
when there are individuals in cohort who do not hold any records in the register data set. In sample data there are 100 such individuals. Usually this is just a nuisance and we want to fill those values with 0 or false or what ever suits our use case.
For this problem there is function left_join_replace_na
for which user can give a list of default values what he or she wants to use.
d<-left_join(sample_cohort,sample_regdata) %>% arrange(personid)
## Joining, by = "personid"
summary(d)
## personid gender postingdate CODE1
## Min. :1101 Min. :1.000 Min. :2000-01-01 Length:10100
## 1st Qu.:1212 1st Qu.:1.000 1st Qu.:2000-01-01 Class :character
## Median :1305 Median :2.000 Median :2000-01-01 Mode :character
## Mean :1305 Mean :1.539 Mean :2000-01-01
## 3rd Qu.:1400 3rd Qu.:2.000 3rd Qu.:2000-01-01
## Max. :1500 Max. :2.000 Max. :2000-01-01
##
## adm_date disc_date icd
## Min. :1990-01-01 Min. :1990-01-18 Length:10100
## 1st Qu.:1993-12-18 1st Qu.:1994-03-21 Class :character
## Median :1998-01-12 Median :1998-04-30 Mode :character
## Mean :1998-01-04 Mean :1998-04-04
## 3rd Qu.:2001-12-27 3rd Qu.:2002-03-25
## Max. :2005-12-30 Max. :2006-06-24
## NA's :100 NA's :100
This time the missing values are filled based on the defaults entered to na_replace_list
.
na_replace_list <- list( "logical"=FALSE, "character"="", "integer"=0L, "double"=0.0, "numeric"=0, "Date"=lubridate::dmy("01-01-1900") ) d_filled<-left_join_replace_na(sample_cohort,sample_regdata,na_replace_list)
## Joining, by = "personid"
summary(d_filled)
## personid gender postingdate CODE1
## Min. :1101 Min. :1.000 Min. :10957 Length:10100
## 1st Qu.:1212 1st Qu.:1.000 1st Qu.:10957 Class :character
## Median :1305 Median :2.000 Median :10957 Mode :character
## Mean :1305 Mean :1.539 Mean :10957
## 3rd Qu.:1400 3rd Qu.:2.000 3rd Qu.:10957
## Max. :1500 Max. :2.000 Max. :10957
## adm_date disc_date icd
## Min. :-25567 Min. :-25567 Length:10100
## 1st Qu.: 8709 1st Qu.: 8802 Class :character
## Median : 10212 Median : 10310 Mode :character
## Mean : 9876 Mean : 9965
## 3rd Qu.: 11667 3rd Qu.: 11754
## Max. : 13147 Max. : 13323
Internally the filling of missing values with default values is executed by another function replace_na_by_type
which is also available for users.
## Give list of default fill values overridding NA d_filled <- d %>% mutate_all(replace_na_by_type,na_replace_list) summary(d_filled)
## personid gender postingdate CODE1
## Min. :1101 Min. :1.000 Min. :10957 Length:10100
## 1st Qu.:1212 1st Qu.:1.000 1st Qu.:10957 Class :character
## Median :1305 Median :2.000 Median :10957 Mode :character
## Mean :1305 Mean :1.539 Mean :10957
## 3rd Qu.:1400 3rd Qu.:2.000 3rd Qu.:10957
## Max. :1500 Max. :2.000 Max. :10957
## adm_date disc_date icd
## Min. :-25567 Min. :-25567 Length:10100
## 1st Qu.: 8709 1st Qu.: 8802 Class :character
## Median : 10212 Median : 10310 Mode :character
## Mean : 9876 Mean : 9965
## 3rd Qu.: 11667 3rd Qu.: 11754
## Max. : 13147 Max. : 13323