Mapping branches - cleaning data
November 18, 2017
R dataI received some interesting data on branch locations from my counterpart on the Personal & Commercial Banking Strategic Business Analysis team. Now, the actual data I received is confidential data that the BMO Market Analytics team put together so I won’t be showing that here, but I found an open source intel option from Payments Canada that I can use instead.
Data
The dataset itself looks to be a very comprehensive list of all bank branches from what looks like every bank that operates in Canada… so lots of bonus points there as a data source. Unfortunately, the data is only available as a PDF so getting it into a usable format will either be very manual (e.g. cutting/pasting in Excel), or quite a bit of code will be dedicated to processing the data.
Here’s a snapshot of the data from the PDF:
Notwithstanding that the data file is a PDF, it is also not exactly the most machine readable data. The bulk of this post will be on getting this data into a clean, machine readable format.
First step is loading this PDF into R:
library(textreadr)
branch.pdf <- read_pdf("../../data/branches/mbrbnksn_2.pdf")
head(branch.pdf, 20) %>% knitr::kable() %>%
kable_styling(bootstrap_options = c("striped", "hover"))
page_id | element_id | text |
---|---|---|
1 | 1 | SECTION I NUMERIC LIST / LISTE NUMÉRIQUE 1 |
1 | 2 | ADS CANADIAN BANK 343 |
1 | 3 | Routing Numbers / |
1 | 4 | Numéros d’acheminement |
1 | 5 | Electronic Paper(MICR) |
1 | 6 | Électronique Papier(MICR) Postal Address - Addresse postale |
1 | 7 | 034300012 00012-343 20 Queen Street West - Suite 2600, Toronto, ON M5H 3R3 |
1 | 8 | 034300092 00092-343 20 Queen Street West - Suite 2600, Toronto, ON M5H 3R3 |
1 | 9 | AMEX BANK OF CANADA 303 |
1 | 10 | Routing Numbers / |
1 | 11 | Numéros d’acheminement |
1 | 12 | Electronic Paper(MICR) |
1 | 13 | Électronique Papier(MICR) Postal Address - Addresse postale |
1 | 14 | 030300012 00012-303 101 McNabb Street, Markham, ON L3R 4H8 |
1 | 15 | 030300022 00022-303 101 McNabb Street, Markham, ON L3R 4H8 |
1 | 16 | 030300032 00032-303 101 McNabb Street, 101 McNabb Street, Markham, ON L3R 4H8 |
1 | 17 | BANK OF AMERICA NATIONAL ASSOCIATION 241 |
1 | 18 | Routing Numbers / |
1 | 19 | Numéros d’acheminement |
1 | 20 | Electronic Paper(MICR) |
I like the read_pdf
function from the package textreadr
rather than the main pdftools
package because it places the text in a data frame. The next step would be to isolate the routing numbers:
branches.route <- str_extract(branch.pdf$text, "[0-9]{9}") %>%
data.frame(Routing = .)
branches.clean <- cbind(branch.pdf, branches.route)
head(branches.clean, 20) %>% knitr::kable() %>%
kable_styling(bootstrap_options = c("striped", "hover"))
page_id | element_id | text | Routing |
---|---|---|---|
1 | 1 | SECTION I NUMERIC LIST / LISTE NUMÉRIQUE 1 | NA |
1 | 2 | ADS CANADIAN BANK 343 | NA |
1 | 3 | Routing Numbers / | NA |
1 | 4 | Numéros d’acheminement | NA |
1 | 5 | Electronic Paper(MICR) | NA |
1 | 6 | Électronique Papier(MICR) Postal Address - Addresse postale | NA |
1 | 7 | 034300012 00012-343 20 Queen Street West - Suite 2600, Toronto, ON M5H 3R3 | 034300012 |
1 | 8 | 034300092 00092-343 20 Queen Street West - Suite 2600, Toronto, ON M5H 3R3 | 034300092 |
1 | 9 | AMEX BANK OF CANADA 303 | NA |
1 | 10 | Routing Numbers / | NA |
1 | 11 | Numéros d’acheminement | NA |
1 | 12 | Electronic Paper(MICR) | NA |
1 | 13 | Électronique Papier(MICR) Postal Address - Addresse postale | NA |
1 | 14 | 030300012 00012-303 101 McNabb Street, Markham, ON L3R 4H8 | 030300012 |
1 | 15 | 030300022 00022-303 101 McNabb Street, Markham, ON L3R 4H8 | 030300022 |
1 | 16 | 030300032 00032-303 101 McNabb Street, 101 McNabb Street, Markham, ON L3R 4H8 | 030300032 |
1 | 17 | BANK OF AMERICA NATIONAL ASSOCIATION 241 | NA |
1 | 18 | Routing Numbers / | NA |
1 | 19 | Numéros d’acheminement | NA |
1 | 20 | Electronic Paper(MICR) | NA |
From the routing numbers, we can extract the bank identification numbers:
branches.clean$Bank.nbr <- substr(branches.clean$Routing, 1, 4)
head(branches.clean, 20) %>% knitr::kable() %>%
kable_styling(bootstrap_options = c("striped", "hover"))
page_id | element_id | text | Routing | Bank.nbr |
---|---|---|---|---|
1 | 1 | SECTION I NUMERIC LIST / LISTE NUMÉRIQUE 1 | NA | NA |
1 | 2 | ADS CANADIAN BANK 343 | NA | NA |
1 | 3 | Routing Numbers / | NA | NA |
1 | 4 | Numéros d’acheminement | NA | NA |
1 | 5 | Electronic Paper(MICR) | NA | NA |
1 | 6 | Électronique Papier(MICR) Postal Address - Addresse postale | NA | NA |
1 | 7 | 034300012 00012-343 20 Queen Street West - Suite 2600, Toronto, ON M5H 3R3 | 034300012 | 0343 |
1 | 8 | 034300092 00092-343 20 Queen Street West - Suite 2600, Toronto, ON M5H 3R3 | 034300092 | 0343 |
1 | 9 | AMEX BANK OF CANADA 303 | NA | NA |
1 | 10 | Routing Numbers / | NA | NA |
1 | 11 | Numéros d’acheminement | NA | NA |
1 | 12 | Electronic Paper(MICR) | NA | NA |
1 | 13 | Électronique Papier(MICR) Postal Address - Addresse postale | NA | NA |
1 | 14 | 030300012 00012-303 101 McNabb Street, Markham, ON L3R 4H8 | 030300012 | 0303 |
1 | 15 | 030300022 00022-303 101 McNabb Street, Markham, ON L3R 4H8 | 030300022 | 0303 |
1 | 16 | 030300032 00032-303 101 McNabb Street, 101 McNabb Street, Markham, ON L3R 4H8 | 030300032 | 0303 |
1 | 17 | BANK OF AMERICA NATIONAL ASSOCIATION 241 | NA | NA |
1 | 18 | Routing Numbers / | NA | NA |
1 | 19 | Numéros d’acheminement | NA | NA |
1 | 20 | Electronic Paper(MICR) | NA | NA |
And we can now identify the banks that each branch belongs to. To make it easier to recognize the banks, I created a manual mapping of bank numbers to bank names:
bank.names <- data.frame(Bank.nbr = c("0001", "0010", "0003", "0002", "0004"),
Bank.name = c("BMO", "CIBC", "RBC", "BNS", "TD"))
bank.branches <- left_join(branches.clean, bank.names)
bank.branches[3161:3165,] %>% knitr::kable() %>%
kable_styling(bootstrap_options = c("striped", "hover"))
page_id | element_id | text | Routing | Bank.nbr | Bank.name | |
---|---|---|---|---|---|---|
3161 | 42 | 33 | 000251193 51193-002 Halifax Centralized Accounting Unit, 1465 Brenton St., 4th Floor, Halifax, NS B3J 3T2 | 000251193 | 0002 | BNS |
3162 | 42 | 34 | 000251219 51219-002 Rundlehorn Plaza, 3735 Rundlehorn Drive N. E., Calgary, AB T1Y 2K1 (Sub to 30049) | 000251219 | 0002 | BNS |
3163 | 42 | 35 | 000251268 51268-002 ISS Operations Control, Scotia Plaza, 40 King St. W. 25th Fl, Toronto, ON M5H 1H1 | 000251268 | 0002 | BNS |
3164 | 42 | 36 | 000251292 51292-002 Eglinton & Castle Knock, 438 Eglinton Avenue West, Toronto, ON M5N 1A2 | 000251292 | 0002 | BNS |
3165 | 42 | 37 | 000251326 51326-002 Barrhaven Centre, 3701 Strandherd Drive, Nepean, ON K2J 4G8 (Sub to 00166) | 000251326 | 0002 | BNS |
And now, the final bit of text extraction, we can pull out the postal codes of the branches to get their locations:
bank.branches$Postal <- str_extract(bank.branches$text,
"[A-Z][0-9][A-Z][:space:][0-9][A-z][0-9]")
head(bank.branches, 20) %>% knitr::kable() %>%
kable_styling(bootstrap_options = c("striped", "hover"))
page_id | element_id | text | Routing | Bank.nbr | Bank.name | Postal |
---|---|---|---|---|---|---|
1 | 1 | SECTION I NUMERIC LIST / LISTE NUMÉRIQUE 1 | NA | NA | NA | NA |
1 | 2 | ADS CANADIAN BANK 343 | NA | NA | NA | NA |
1 | 3 | Routing Numbers / | NA | NA | NA | NA |
1 | 4 | Numéros d’acheminement | NA | NA | NA | NA |
1 | 5 | Electronic Paper(MICR) | NA | NA | NA | NA |
1 | 6 | Électronique Papier(MICR) Postal Address - Addresse postale | NA | NA | NA | NA |
1 | 7 | 034300012 00012-343 20 Queen Street West - Suite 2600, Toronto, ON M5H 3R3 | 034300012 | 0343 | NA | M5H 3R3 |
1 | 8 | 034300092 00092-343 20 Queen Street West - Suite 2600, Toronto, ON M5H 3R3 | 034300092 | 0343 | NA | M5H 3R3 |
1 | 9 | AMEX BANK OF CANADA 303 | NA | NA | NA | NA |
1 | 10 | Routing Numbers / | NA | NA | NA | NA |
1 | 11 | Numéros d’acheminement | NA | NA | NA | NA |
1 | 12 | Electronic Paper(MICR) | NA | NA | NA | NA |
1 | 13 | Électronique Papier(MICR) Postal Address - Addresse postale | NA | NA | NA | NA |
1 | 14 | 030300012 00012-303 101 McNabb Street, Markham, ON L3R 4H8 | 030300012 | 0303 | NA | L3R 4H8 |
1 | 15 | 030300022 00022-303 101 McNabb Street, Markham, ON L3R 4H8 | 030300022 | 0303 | NA | L3R 4H8 |
1 | 16 | 030300032 00032-303 101 McNabb Street, 101 McNabb Street, Markham, ON L3R 4H8 | 030300032 | 0303 | NA | L3R 4H8 |
1 | 17 | BANK OF AMERICA NATIONAL ASSOCIATION 241 | NA | NA | NA | NA |
1 | 18 | Routing Numbers / | NA | NA | NA | NA |
1 | 19 | Numéros d’acheminement | NA | NA | NA | NA |
1 | 20 | Electronic Paper(MICR) | NA | NA | NA | NA |
That’s pretty much it in terms of cleaning the data to a point where we can then get the longitude/latitudes out of this dataset. Check out the corresponding map I made in the Projects section.