Stringr 4 ways

Four approaches to feature engineering with regular expressions in R

Shannon Pileggi
12-11-2018

Update May 22, 2019: Thanks to @hglanz for noting that I could have used pull instead of the . as a placeholder.


library(tidyverse) # general use
library(titanic)   # to get titanic data set

Overview

The Name variable in the titanic data set has all unique values. To get started, let’s visually inspect a few values.


titanic_train %>% 
  select(Name) %>% 
  head(10)

                                                  Name
1                              Braund, Mr. Owen Harris
2  Cumings, Mrs. John Bradley (Florence Briggs Thayer)
3                               Heikkinen, Miss. Laina
4         Futrelle, Mrs. Jacques Heath (Lily May Peel)
5                             Allen, Mr. William Henry
6                                     Moran, Mr. James
7                              McCarthy, Mr. Timothy J
8                       Palsson, Master. Gosta Leonard
9    Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
10                 Nasser, Mrs. Nicholas (Adele Achem)

In this brief print out, each passenger’s title is consistently located between , and .. Luckily, this holds true for all observations! With this consistency, we can somewhat easily extract a title from Name. Depending on your field, this operation may be referred to something along the lines of data preparation or feature engineering.

TL; DR

Approach Function(s) Regular expression(s)
1 str_locate + str_sub "," + "\\."
2 str_match "(.*)(, )(.*)(\\.)(.*)"
3 str_extract + str_sub "([A-z]+)\\."
4 str_replace_all "(.*, )|(\\..*)"

Read on for explanations!

Overview of regular expressions

Regular expressions can be used to parse character strings, which you can think of as a key to unlock string patterns. The trick is identify the right regular expression + function combination. Let’s demo four ways to tackle the challenge utilizing functions from the stringr package; each method specifies a different string pattern to match.


library(stringr)

Extracting title from name

First approach

ICYMI, the double bracket in titanic_train[["Name"]] is used to extract a named variable vector from a data frame, which has some benefits over the more commonly used dollar sign (i.e., titanic_train$Name). Now onward.

The str_locate function produces the starting and ending position of a specified pattern. If we consider the comma to be a pattern, we can figure out where it is located in each name. Here, the starting and ending value is the same because the comma is only one character.


titanic_train[["Name"]] %>% 
  str_locate(",") %>%
  head()

     start end
[1,]     7   7
[2,]     8   8
[3,]    10  10
[4,]     9   9
[5,]     6   6
[6,]     6   6

Knowing this, we can identify the positions of the comma and the period and then extract the text in between. Some notes here:


comma_pos <- titanic_train[["Name"]] %>% 
  str_locate(",") %>% 
  .[,1]

period_pos <- titanic_train[["Name"]] %>% 
  str_locate("\\.") %>% 
  .[,1]

Now we can use str_sub to extract substrings from the character vector based on their physical position. To exclude the punctuation and white space, we can add two to the comma position and subtract one from the period position to get the title only.


titanic_train[["Name"]] %>% 
  str_sub(comma_pos + 2, period_pos - 1) %>% 
  head()

[1] "Mr"   "Mrs"  "Miss" "Mrs"  "Mr"   "Mr"  

Super!

Second approach

The str_match function creates a character matrix for each group matched in the specified pattern. With the correct regular expression, str_match returns the complete match in addition to each matched group. Here’s a quick example:


# ----------5 groups-->>>----1---2---3----4---5----                      
str_match("XXX, YYY. ZZZ", "(.*)(, )(.*)(\\.)(.*)")

     [,1]            [,2]  [,3] [,4]  [,5] [,6]  
[1,] "XXX, YYY. ZZZ" "XXX" ", " "YYY" "."  " ZZZ"

Let’s break down this regular expression pattern.

To execute this, we’ll grab the 4th column to catch our title.


titanic_train[["Name"]] %>% 
  str_match("(.*)(, )(.*)(\\.)(.*)") %>%
  .[,4] %>% 
  head()

[1] "Mr"   "Mrs"  "Miss" "Mrs"  "Mr"   "Mr"  

All right, we got it again!

Third approach

Lastly, let’s use the str_extract function to extract matching patterns. This seems like what we wanted to do all along!


titanic_train[["Name"]] %>% 
  str_extract("([A-z]+)\\.") %>%
  head()

[1] "Mr."   "Mrs."  "Miss." "Mrs."  "Mr."   "Mr."  

Let’s break down this regular expression:

This pattern is a bit more sophisticated to compose than the previous ones, but it gets right to the point! This last effort does end in a period, whereas the others do not. If we wanted to remove the period for consistency, we could use str_sub with the end argument to specify the position of the last character.


titanic_train[["Name"]] %>% 
  str_extract("([A-z]+)\\.") %>%
  str_sub(end = -2) %>%
  head()

[1] "Mr"   "Mrs"  "Miss" "Mrs"  "Mr"   "Mr"  

Fourth approach

As a last approach, we can use str_replace_all to replace all matched patterns with null character values. Here, we specify the pattern and then the replacement string.


titanic_train[["Name"]] %>% 
  str_replace_all("(.*, )|(\\..*)", "") %>%
  head()

[1] "Mr"   "Mrs"  "Miss" "Mrs"  "Mr"   "Mr"  

In this regular expression,

Re-classifying entries

Now that we figured out how to extract the title, I’ll utilize the last method and assign title as a variable to the titanic_train data set using the mutate function.


titanic_train <- titanic_train %>%
  mutate(title = str_replace_all(titanic_train[["Name"]], "(.*, )|(\\..*)", ""))

Now let’s use count to get a frequency table of the titles, with the sort = TRUE option to arrange the results in descending order.


titanic_train %>%
  count(title, sort = TRUE)

          title   n
1            Mr 517
2          Miss 182
3           Mrs 125
4        Master  40
5            Dr   7
6           Rev   6
7           Col   2
8         Major   2
9          Mlle   2
10         Capt   1
11          Don   1
12     Jonkheer   1
13         Lady   1
14          Mme   1
15           Ms   1
16          Sir   1
17 the Countess   1

We can see that there are several infrequent titles occuring only one or two times, and so we should re-classify them. If you want to squeeze the most juice out of your data, try to figure out the historical context and meaning of those titles to create a better classification for them. For now, let’s take the easy way out by just re-classifying them to an other group.

Fortunately, the forcats package has an awesome function that let’s us do this quickly: fct_lump. We’re using mutate again to re-classified title. The fct_lump function combines the least frequent values together in an other group, and the n = 6 option specifies to keep the 6 most common values (so the 7th value is other).


titanic_train %>%
  mutate(title = fct_lump(title, n = 6)) %>%
  count(title, sort = TRUE)

   title   n
1     Mr 517
2   Miss 182
3    Mrs 125
4 Master  40
5  Other  14
6     Dr   7
7    Rev   6

If you wanted to explicitly re-code the infrequent titles to something more meaningful than other, look into fct_recode.

Super, now title is ready to use for analysis!

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Pileggi (2018, Dec. 11). PIPING HOT DATA: Stringr 4 ways. Retrieved from https://www.pipinghotdata.com/posts/2018-12-11-stringr-4-ways/

BibTeX citation

@misc{pileggi2018stringr,
  author = {Pileggi, Shannon},
  title = {PIPING HOT DATA: Stringr 4 ways},
  url = {https://www.pipinghotdata.com/posts/2018-12-11-stringr-4-ways/},
  year = {2018}
}