Chapter 3 Data transformation

3.1 Overviews

Here is the overview of columns we need for this project before cleaning. First, Date need to convert to class Date in format “Year-Month-Day”. Second, Event Descriptor will be convert into two new variables to narrow down the information contains in single variable. Finally, all variables under type character will be transformed as factor.

## tibble [881,617 x 14] (S3: tbl_df/tbl/data.frame)
##  $ Year                       : num [1:881617] 2019 2019 2019 2019 2019 ...
##  $ Crash Descriptor           : chr [1:881617] "Property Damage Accident" "Property Damage Accident" "Property Damage Accident" "Property Damage & Injury Accident" ...
##  $ Time                       : 'hms' num [1:881617] 17:14:00 22:08:00 14:50:00 22:50:00 ...
##   ..- attr(*, "units")= chr "secs"
##  $ Date                       : chr [1:881617] "12/31/2019" "12/31/2019" "12/31/2019" "12/31/2019" ...
##  $ Day of Week                : chr [1:881617] "Tuesday" "Tuesday" "Tuesday" "Tuesday" ...
##  $ Lighting Conditions        : chr [1:881617] "Dark-Road Unlighted" "Dark-Road Unlighted" "Daylight" "Dark-Road Lighted" ...
##  $ Collision Type Descriptor  : chr [1:881617] "OTHER" "SIDESWIPE" "OVERTAKING" "RIGHT ANGLE" ...
##  $ County Name                : chr [1:881617] "YATES" "JEFFERSON" "NASSAU" "SUFFOLK" ...
##  $ Road Descriptor            : chr [1:881617] "Straight and Level" "Straight and Level" "Straight and Level" "Straight and Level" ...
##  $ Weather Conditions         : chr [1:881617] "Snow" "Snow" "Cloudy" "Rain" ...
##  $ Traffic Control Device     : chr [1:881617] "None" "None" "Traffic Signal" "Stop Sign" ...
##  $ Road Surface Conditions    : chr [1:881617] "Wet" "Snow/Ice" "Dry" "Wet" ...
##  $ Event Descriptor           : chr [1:881617] "Deer" "Other Motor Vehicle, Collision With" "Other Motor Vehicle, Collision With" "Other Motor Vehicle, Collision With" ...
##  $ Number of Vehicles Involved: num [1:881617] 1 2 2 2 2 1 2 1 2 2 ...

3.2 Change Format of Variable Date

Here is the view of Date before the transformation.

## [1] "12/31/2019"
## [1] "character"

Here is the view of Date after the transformation.

## [1] "2019-12-31"
## [1] "Date"

3.3 Extract reason of crashing from Event Descriptor

3.3.1 Collision_Type

Following it the current levels in Event Descriptor. For generalization, new variable Collision_Type is introduced to indicate the high-level collision type: “Collision With Fixed Object”, “Collision With subject” (will be shortened as “Collision With” in the table), and “Non-collision”

For generating new variable Collision_Type, two changes needed in Event Descriptor: 1. “Deer” can be included in “Animal, Collision With” 2. “Other Pedestrian” can be included in “Other Pedestrian, Collision With”

##  [1] "Animal, Collision With"                                      
##  [2] "Barrier, Collision With Fixed Object"                        
##  [3] "Bicyclist, Collision With"                                   
##  [4] "Bridge Structure, Collision With Fixed Object"               
##  [5] "Building/Wall, Collision With Fixed Object"                  
##  [6] "Crash Cushion, Collision With Fixed Object"                  
##  [7] "Culver/Head Wall, Collision With Fixed Object"               
##  [8] "Curbing, Collision With Fixed Object"                        
##  [9] "Deer"                                                        
## [10] "Earth Embankment/Rock Cut/Ditch, Collision With Fixed Object"
## [11] "Fence, Collision With Fixed Object"                          
## [12] "Fire Hydrant, Collision With Fixed Object"                   
## [13] "Fire/Explosion, Non-Collision"                               
## [14] "Guide Rail - End, Collision With Fixed Object"               
## [15] "Guide Rail - Not At End, Collision With Fixed Object"        
## [16] "In-Line Skater, Collision With"                              
## [17] "Light Support/Utility Pole, Collision With Fixed Object"     
## [18] "Median - End, Collision With Fixed Object"                   
## [19] "Median - Not At End, Collision With Fixed Object"            
## [20] "Other Fixed Object*, Collision With Fixed Object"            
## [21] "Other Motor Vehicle, Collision With"                         
## [22] "Other Object (Not Fixed)*, Collision With"                   
## [23] "Other Pedestrian"                                            
## [24] "Other*, Non-Collision"                                       
## [25] "Overturned, Non-Collision"                                   
## [26] "Pedestrian, Collision With"                                  
## [27] "Railroad Train, Collision With"                              
## [28] "Ran Off Roadway Only, Non-Collision"                         
## [29] "Sign Post, Collision With Fixed Object"                      
## [30] "Snow Embankment, Collision With Fixed Object"                
## [31] "Submersion, Non-Collision"                                   
## [32] "Tree, Collision With Fixed Object"

After cleaning and extracting keywords from Event Descriptor, here are the levels for Collision_Type:

## [1] " Collision With"              " Collision With Fixed Object"
## [3] " Non-Collision"

3.3.2 Collision_Detail

Besides the general collision type information, Event Descriptor also told us the collision details, which will be stored in new variable Collision_Detail. Following is the view of collision details before data cleaning.

##  [1] "Animal"                          "Barrier"                        
##  [3] "Bicyclist"                       "Bridge Structure"               
##  [5] "Building/Wall"                   "Crash Cushion"                  
##  [7] "Culver/Head Wall"                "Curbing"                        
##  [9] "Earth Embankment/Rock Cut/Ditch" "Fence"                          
## [11] "Fire Hydrant"                    "Fire/Explosion"                 
## [13] "Guide Rail - End"                "Guide Rail - Not At End"        
## [15] "In-Line Skater"                  "Light Support/Utility Pole"     
## [17] "Median - End"                    "Median - Not At End"            
## [19] "Other Fixed Object*"             "Other Motor Vehicle"            
## [21] "Other Object (Not Fixed)*"       "Other Pedestrian"               
## [23] "Other*"                          "Overturned"                     
## [25] "Pedestrian"                      "Railroad Train"                 
## [27] "Ran Off Roadway Only"            "Sign Post"                      
## [29] "Snow Embankment"                 "Submersion"                     
## [31] "Tree"

In the above categories, “Other Pedestrian” and “Pedestrian” are represented the same collision type, so they can combine into the same class “Pedestrian”. The following shows the final 30 levels in Collision_Detail.

##  [1] "Animal"                          "Barrier"                        
##  [3] "Bicyclist"                       "Bridge Structure"               
##  [5] "Building/Wall"                   "Crash Cushion"                  
##  [7] "Culver/Head Wall"                "Curbing"                        
##  [9] "Earth Embankment/Rock Cut/Ditch" "Fence"                          
## [11] "Fire Hydrant"                    "Fire/Explosion"                 
## [13] "Guide Rail - End"                "Guide Rail - Not At End"        
## [15] "In-Line Skater"                  "Light Support/Utility Pole"     
## [17] "Median - End"                    "Median - Not At End"            
## [19] "Other Fixed Object*"             "Other Motor Vehicle"            
## [21] "Other Object (Not Fixed)*"       "Other*"                         
## [23] "Overturned"                      "Pedestrian"                     
## [25] "Railroad Train"                  "Ran Off Roadway Only"           
## [27] "Sign Post"                       "Snow Embankment"                
## [29] "Submersion"                      "Tree"

3.4 Convert to Factor Level

Finally, all the variable type convert to factor level for our further data analysis. Here is the table sumamary:

## tibble [881,617 x 16] (S3: tbl_df/tbl/data.frame)
##  $ Year                       : num [1:881617] 2019 2019 2019 2019 2019 ...
##  $ Crash Descriptor           : Factor w/ 4 levels "Fatal Accident",..: 4 4 4 3 3 4 4 3 4 4 ...
##  $ Time                       : 'hms' num [1:881617] 17:14:00 22:08:00 14:50:00 22:50:00 ...
##   ..- attr(*, "units")= chr "secs"
##  $ Date                       : Date[1:881617], format: "2019-12-31" "2019-12-31" ...
##  $ Day of Week                : Factor w/ 7 levels "Friday","Monday",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ Lighting Conditions        : Factor w/ 6 levels "Dark-Road Lighted",..: 2 2 4 1 4 4 4 2 1 4 ...
##  $ Collision Type Descriptor  : Factor w/ 11 levels "HEAD ON","LEFT TURN (0)",..: 4 10 5 7 6 4 6 4 7 6 ...
##  $ County Name                : Factor w/ 63 levels "ALBANY","ALLEGANY",..: 63 23 30 52 36 32 32 61 55 8 ...
##  $ Road Descriptor            : Factor w/ 7 levels "Curve and Grade",..: 5 5 5 5 5 5 5 1 4 5 ...
##  $ Weather Conditions         : Factor w/ 8 levels "Clear","Cloudy",..: 7 7 2 5 1 7 2 5 1 2 ...
##  $ Traffic Control Device     : Factor w/ 19 levels "Construction Work Area",..: 5 5 16 14 4 5 16 4 14 4 ...
##  $ Road Surface Conditions    : Factor w/ 8 levels "Dry","Flooded Water",..: 8 6 1 8 1 6 8 8 1 1 ...
##  $ Event Descriptor           : Factor w/ 31 levels "Animal, Collision With",..: 1 20 20 20 20 14 20 13 20 20 ...
##  $ Number of Vehicles Involved: num [1:881617] 1 2 2 2 2 1 2 1 2 2 ...
##  $ Collision_Type             : Factor w/ 3 levels " Collision With",..: 1 1 1 1 1 2 1 2 1 1 ...
##  $ Collision_Detail           : Factor w/ 30 levels "Animal","Barrier",..: 1 20 20 20 20 14 20 13 20 20 ...