Install and load necessary libraries for this example

library(regstudies)
library(tidyverse)
library(lubridate)

Example dataset

The regstudies package provides useful tools for modifying and filtering the register data sets. Typically in register studies the data consists of events, which are stored in data as time stamps. In our example dataset the time stamps are date of hospital visits. Also, in register studies the data are commonly studied with respect to index date, which is a date of interest that can vary by individuals.

Our example data is simulated data that represents a postal questionnaire survey study which has been extended to register data. There are two data sets, the sample_cohort is holding the study group, and sample_regdata holds the hospital visits for the study group members.

More precisely, the sample_cohort lists the individual id study numbers as id for whom the register data has been collected. It also list the postingdate holding the date of submitting the postal questionnaire.

The sample_regdata contains also variable id which is necessary to be able to link the data sets. The sample_regdata also contains variables holding the diagnosis codes at CODE1 and times of hospital admission adm_date and hospital discharge disc_date.

summary(sample_cohort)
##     personid        gender       postingdate        
##  Min.   :1101   Min.   :1.000   Min.   :2000-01-01  
##  1st Qu.:1201   1st Qu.:1.000   1st Qu.:2000-01-01  
##  Median :1300   Median :2.000   Median :2000-01-01  
##  Mean   :1300   Mean   :1.525   Mean   :2000-01-01  
##  3rd Qu.:1400   3rd Qu.:2.000   3rd Qu.:2000-01-01  
##  Max.   :1500   Max.   :2.000   Max.   :2000-01-01

Dataset sample_regdata is a dataset which has events for the selected population with ICD codes and date-variables:

summary(sample_regdata)
##     personid       CODE1              adm_date            disc_date         
##  Min.   :1101   Length:10000       Min.   :1990-01-01   Min.   :1990-01-18  
##  1st Qu.:1212   Class :character   1st Qu.:1993-12-18   1st Qu.:1994-03-21  
##  Median :1305   Mode  :character   Median :1998-01-12   Median :1998-04-30  
##  Mean   :1305                      Mean   :1998-01-04   Mean   :1998-04-04  
##  3rd Qu.:1400                      3rd Qu.:2001-12-27   3rd Qu.:2002-03-25  
##  Max.   :1500                      Max.   :2005-12-30   Max.   :2006-06-24  
##      icd           
##  Length:10000      
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

Preparation for classifications

We use two sample datasets sample_regdata and sample_cohort, which are included in the package. Joining these two datasets we got a full dataset ready for analyzes:

# Join register data with population cohort data
d <- sample_regdata %>%
  left_join(sample_cohort)
## Joining, by = "personid"
head(d)
## # A tibble: 6 x 7
##   personid CODE1 adm_date   disc_date  icd   gender postingdate
##      <int> <chr> <date>     <date>     <chr>  <dbl> <date>     
## 1     1101 196   1990-03-28 1990-05-20 icd9       2 2000-01-01 
## 2     1101 I26   1990-04-25 1990-10-01 icd9       2 2000-01-01 
## 3     1101 E222  1990-06-28 1990-08-07 icd9       2 2000-01-01 
## 4     1101 39891 1990-09-19 1991-03-05 icd9       2 2000-01-01 
## 5     1101 G10   1990-11-24 1991-01-04 icd9       2 2000-01-01 
## 6     1101 I278  1991-11-01 1992-01-16 icd9       2 2000-01-01

Classification of codes can handle multiple code types on one occasion, but user needs to tell us which code is which type. This functionality allows user to classify data what contains both ICD-codes and pharmaceutical codes at one data. For example when using multiple versions of ICD-codes, we create a new variable called icd:

## create a new help variable `icd` which determines if code is ICD-8, ICD-9 or ICD-10 by year
d <- d %>%
  mutate(icd = case_when(
    year(disc_date) < 1987 ~ "icd8",
    year(disc_date) < 1996 & year(disc_date)>=1987 ~ "icd9",
    year(disc_date) >= 1996 ~ "icd10"
    ))

Elixhauser

With function classify_elixhauser() and sum_score() you can use Elixhauser classification and calculate scores. Check more info of classifications from classification tables-section.

regstudies:::elixhauser_classes
## # A tibble: 31 x 6
##    class_elixhauser label_elixhauser  icd9    icd10  score_AHRQ score_van_Walra…
##    <chr>            <chr>             <chr>   <chr>       <dbl>            <dbl>
##  1 chf              Congestive Heart… ^39891… ^I099…          9                7
##  2 carit            Cardiac arrhytmi… ^4260|… ^I441…          0                5
##  3 valv             Valvular disease  ^0932|… ^A520…          0               -1
##  4 pcd              Pulmonary circul… ^4150|… ^I26|…          6                4
##  5 pvd              Peripheral vascu… ^0930|… ^I70|…          3                2
##  6 hypunc           Hypertension, un… ^401    ^I10           -1                0
##  7 hypc             Hypertension, co… ^402|^… ^I11|…         -1                0
##  8 para             Paralysis         ^3341|… ^G041…          5                7
##  9 ond              Other neurologic… ^3319|… ^G10|…          5                6
## 10 cpd              Chronic pulmonar… ^4168|… ^I278…          3                3
## # … with 21 more rows
d %>%
  # filter(icd == "icd10") %>% 
  classify_elixhauser(icd_codes = CODE1)
## Joining, by = "label_elixhauser"
## # A tibble: 10,470 x 11
##    personid CODE1 adm_date   disc_date  icd   gender postingdate
##       <int> <chr> <date>     <date>     <chr>  <dbl> <date>     
##  1     1101 196   1990-03-28 1990-05-20 icd9       2 2000-01-01 
##  2     1101 I26   1990-04-25 1990-10-01 icd9       2 2000-01-01 
##  3     1101 E222  1990-06-28 1990-08-07 icd9       2 2000-01-01 
##  4     1101 39891 1990-09-19 1991-03-05 icd9       2 2000-01-01 
##  5     1101 G10   1990-11-24 1991-01-04 icd9       2 2000-01-01 
##  6     1101 I278  1991-11-01 1992-01-16 icd9       2 2000-01-01 
##  7     1101 2409  1991-12-19 1992-05-27 icd9       2 2000-01-01 
##  8     1101 F20   1992-02-14 1992-07-29 icd9       2 2000-01-01 
##  9     1101 F10   1992-02-19 1992-03-15 icd9       2 2000-01-01 
## 10     1101 4150  1992-03-23 1992-08-13 icd9       2 2000-01-01 
## # … with 10,460 more rows, and 4 more variables: class_elixhauser <chr>,
## #   label_elixhauser <chr>, score_AHRQ <dbl>, score_van_Walraven <dbl>

When using function argument verbose = TRUE, message “Element ‘regex.rm’ NOT in use. Exceptions omitted.” is normal when ……..

To calculate elixhauser scores for each individual you just need to add sum_score function. The premade classification table contains two alternative scores to be used score_AHRQ and score_van_Walraven both of which can be calculated at one function call.

# Calculate scores
d %>%
  classify_elixhauser(icd_codes = CODE1) %>%
  sum_score(score_AHRQ, score_van_Walraven)
## Joining, by = "label_elixhauser"
## # A tibble: 300 x 3
##    personid `sum(score_AHRQ)` `sum(score_van_Walraven)`
##       <int>             <dbl>                     <dbl>
##  1     1101                55                        51
##  2     1103                25                        21
##  3     1104                37                        41
##  4     1105                21                        22
##  5     1107                22                        35
##  6     1110                43                        32
##  7     1112                25                        39
##  8     1114                 0                        12
##  9     1116                 6                        13
## 10     1118                19                        29
## # … with 290 more rows

Function classifies all the rows which has classification code in regstudies:::elixhauser_classes. If the is no classification on certain ICD code, row will get NA values.

Historgram of sums…..

Charlson comorbidity

Classifying Charlson comorbidity classes give as output long formatted data as wel Elixhauser classification:

## Classifying Charlson comorbidity classes to long format
charlson <- d %>%
  classify_charlson(CODE1)
## Joining, by = "label_charlson"
head(charlson)
## # A tibble: 6 x 10
##   personid CODE1 adm_date   disc_date  icd   gender postingdate class_charlson
##      <int> <chr> <date>     <date>     <chr>  <dbl> <date>      <chr>         
## 1     1101 196   1990-03-28 1990-05-20 icd9       2 2000-01-01  metacanc      
## 2     1101 I26   1990-04-25 1990-10-01 icd9       2 2000-01-01  <NA>          
## 3     1101 E222  1990-06-28 1990-08-07 icd9       2 2000-01-01  <NA>          
## 4     1101 39891 1990-09-19 1991-03-05 icd9       2 2000-01-01  chf           
## 5     1101 G10   1990-11-24 1991-01-04 icd9       2 2000-01-01  <NA>          
## 6     1101 I278  1991-11-01 1992-01-16 icd9       2 2000-01-01  <NA>          
## # … with 2 more variables: label_charlson <chr>, score_charlson <dbl>

Long data to wide data

For example in modelling you might want to have dataset in wide format. To modify dataset to wide format, use dplyr::pivot_wider() verb:

charlson %>%
  mutate(class_charlson=paste0("cl_",class_charlson)) %>% # TODO: laita tämä ominaisuus classify-funktioon sisälle!
  mutate(score_charlson=as.integer(score_charlson>0)) %>%
  tidyr::pivot_wider(names_from="class_charlson",
                     values_from="score_charlson",
                     values_fill=0) %>%
  select(-all_of(c("cl_NA","label_charlson"))) -> charlson_wide
head(charlson_wide)
## # A tibble: 6 x 19
##   personid CODE1 adm_date   disc_date  icd   gender postingdate cl_metacanc
##      <int> <chr> <date>     <date>     <chr>  <dbl> <date>            <int>
## 1     1101 196   1990-03-28 1990-05-20 icd9       2 2000-01-01            1
## 2     1101 I26   1990-04-25 1990-10-01 icd9       2 2000-01-01            0
## 3     1101 E222  1990-06-28 1990-08-07 icd9       2 2000-01-01            0
## 4     1101 39891 1990-09-19 1991-03-05 icd9       2 2000-01-01            0
## 5     1101 G10   1990-11-24 1991-01-04 icd9       2 2000-01-01            0
## 6     1101 I278  1991-11-01 1992-01-16 icd9       2 2000-01-01            0
## # … with 11 more variables: cl_chf <int>, cl_hp <int>, cl_canc <int>,
## #   cl_pvd <int>, cl_pud <int>, cl_rend <int>, cl_mld <int>, cl_diab <int>,
## #   cl_diabwc <int>, cl_aids <int>, cl_copd <int>

Extra: labeling wide data

You can also add labels to variables in wide format data with code below

## calculating labels
regstudies:::charlson_classes %>%
  select(class_charlson,label_charlson) %>%
  mutate(class_charlson=paste0("cl_",class_charlson)) -> labels

# setting up the labels for wide data
for(i in 1:dim(labels)[1]) {
  l <- labels$class_charlson[i]
  if(!is.null(charlson_wide[[l]])) {
    attr(charlson_wide[[l]], "label") <- labels$label_charlson[i]
  }
}