Mapping branches - cleaning data

November 18, 2017
R data

I received some interesting data on branch locations from my counterpart on the Personal & Commercial Banking Strategic Business Analysis team. Now, the actual data I received is confidential data that the BMO Market Analytics team put together so I won’t be showing that here, but I found an open source intel option from Payments Canada that I can use instead.

Data

The dataset itself looks to be a very comprehensive list of all bank branches from what looks like every bank that operates in Canada… so lots of bonus points there as a data source. Unfortunately, the data is only available as a PDF so getting it into a usable format will either be very manual (e.g. cutting/pasting in Excel), or quite a bit of code will be dedicated to processing the data.

Here’s a snapshot of the data from the PDF:

Payments Canada - Financial Institutions Branch Directory Sample Data

Payments Canada - Financial Institutions Branch Directory Sample Data

Notwithstanding that the data file is a PDF, it is also not exactly the most machine readable data. The bulk of this post will be on getting this data into a clean, machine readable format.

First step is loading this PDF into R:

library(textreadr)

branch.pdf <- read_pdf("../../data/branches/mbrbnksn_2.pdf")
head(branch.pdf, 20) %>% knitr::kable() %>%
        kable_styling(bootstrap_options = c("striped", "hover"))
page_id element_id text
1 1 SECTION I NUMERIC LIST / LISTE NUMÉRIQUE 1
1 2 ADS CANADIAN BANK 343
1 3 Routing Numbers /
1 4 Numéros d’acheminement
1 5 Electronic Paper(MICR)
1 6 Électronique Papier(MICR) Postal Address - Addresse postale
1 7 034300012 00012-343 20 Queen Street West - Suite 2600, Toronto, ON M5H 3R3
1 8 034300092 00092-343 20 Queen Street West - Suite 2600, Toronto, ON M5H 3R3
1 9 AMEX BANK OF CANADA 303
1 10 Routing Numbers /
1 11 Numéros d’acheminement
1 12 Electronic Paper(MICR)
1 13 Électronique Papier(MICR) Postal Address - Addresse postale
1 14 030300012 00012-303 101 McNabb Street, Markham, ON L3R 4H8
1 15 030300022 00022-303 101 McNabb Street, Markham, ON L3R 4H8
1 16 030300032 00032-303 101 McNabb Street, 101 McNabb Street, Markham, ON L3R 4H8
1 17 BANK OF AMERICA NATIONAL ASSOCIATION 241
1 18 Routing Numbers /
1 19 Numéros d’acheminement
1 20 Electronic Paper(MICR)


I like the read_pdf function from the package textreadr rather than the main pdftools package because it places the text in a data frame. The next step would be to isolate the routing numbers:

branches.route <- str_extract(branch.pdf$text, "[0-9]{9}") %>%
                        data.frame(Routing = .)
branches.clean <- cbind(branch.pdf, branches.route)

head(branches.clean, 20) %>% knitr::kable() %>%
        kable_styling(bootstrap_options = c("striped", "hover"))
page_id element_id text Routing
1 1 SECTION I NUMERIC LIST / LISTE NUMÉRIQUE 1 NA
1 2 ADS CANADIAN BANK 343 NA
1 3 Routing Numbers / NA
1 4 Numéros d’acheminement NA
1 5 Electronic Paper(MICR) NA
1 6 Électronique Papier(MICR) Postal Address - Addresse postale NA
1 7 034300012 00012-343 20 Queen Street West - Suite 2600, Toronto, ON M5H 3R3 034300012
1 8 034300092 00092-343 20 Queen Street West - Suite 2600, Toronto, ON M5H 3R3 034300092
1 9 AMEX BANK OF CANADA 303 NA
1 10 Routing Numbers / NA
1 11 Numéros d’acheminement NA
1 12 Electronic Paper(MICR) NA
1 13 Électronique Papier(MICR) Postal Address - Addresse postale NA
1 14 030300012 00012-303 101 McNabb Street, Markham, ON L3R 4H8 030300012
1 15 030300022 00022-303 101 McNabb Street, Markham, ON L3R 4H8 030300022
1 16 030300032 00032-303 101 McNabb Street, 101 McNabb Street, Markham, ON L3R 4H8 030300032
1 17 BANK OF AMERICA NATIONAL ASSOCIATION 241 NA
1 18 Routing Numbers / NA
1 19 Numéros d’acheminement NA
1 20 Electronic Paper(MICR) NA


From the routing numbers, we can extract the bank identification numbers:

branches.clean$Bank.nbr <- substr(branches.clean$Routing, 1, 4)
head(branches.clean, 20) %>% knitr::kable() %>%
        kable_styling(bootstrap_options = c("striped", "hover"))
page_id element_id text Routing Bank.nbr
1 1 SECTION I NUMERIC LIST / LISTE NUMÉRIQUE 1 NA NA
1 2 ADS CANADIAN BANK 343 NA NA
1 3 Routing Numbers / NA NA
1 4 Numéros d’acheminement NA NA
1 5 Electronic Paper(MICR) NA NA
1 6 Électronique Papier(MICR) Postal Address - Addresse postale NA NA
1 7 034300012 00012-343 20 Queen Street West - Suite 2600, Toronto, ON M5H 3R3 034300012 0343
1 8 034300092 00092-343 20 Queen Street West - Suite 2600, Toronto, ON M5H 3R3 034300092 0343
1 9 AMEX BANK OF CANADA 303 NA NA
1 10 Routing Numbers / NA NA
1 11 Numéros d’acheminement NA NA
1 12 Electronic Paper(MICR) NA NA
1 13 Électronique Papier(MICR) Postal Address - Addresse postale NA NA
1 14 030300012 00012-303 101 McNabb Street, Markham, ON L3R 4H8 030300012 0303
1 15 030300022 00022-303 101 McNabb Street, Markham, ON L3R 4H8 030300022 0303
1 16 030300032 00032-303 101 McNabb Street, 101 McNabb Street, Markham, ON L3R 4H8 030300032 0303
1 17 BANK OF AMERICA NATIONAL ASSOCIATION 241 NA NA
1 18 Routing Numbers / NA NA
1 19 Numéros d’acheminement NA NA
1 20 Electronic Paper(MICR) NA NA


And we can now identify the banks that each branch belongs to. To make it easier to recognize the banks, I created a manual mapping of bank numbers to bank names:

bank.names <- data.frame(Bank.nbr = c("0001", "0010", "0003", "0002", "0004"),
                         Bank.name = c("BMO", "CIBC", "RBC", "BNS", "TD"))

bank.branches <- left_join(branches.clean, bank.names)

bank.branches[3161:3165,] %>% knitr::kable() %>%
        kable_styling(bootstrap_options = c("striped", "hover"))
page_id element_id text Routing Bank.nbr Bank.name
3161 42 33 000251193 51193-002 Halifax Centralized Accounting Unit, 1465 Brenton St., 4th Floor, Halifax, NS B3J 3T2 000251193 0002 BNS
3162 42 34 000251219 51219-002 Rundlehorn Plaza, 3735 Rundlehorn Drive N. E., Calgary, AB T1Y 2K1 (Sub to 30049) 000251219 0002 BNS
3163 42 35 000251268 51268-002 ISS Operations Control, Scotia Plaza, 40 King St. W. 25th Fl, Toronto, ON M5H 1H1 000251268 0002 BNS
3164 42 36 000251292 51292-002 Eglinton & Castle Knock, 438 Eglinton Avenue West, Toronto, ON M5N 1A2 000251292 0002 BNS
3165 42 37 000251326 51326-002 Barrhaven Centre, 3701 Strandherd Drive, Nepean, ON K2J 4G8 (Sub to 00166) 000251326 0002 BNS


And now, the final bit of text extraction, we can pull out the postal codes of the branches to get their locations:

bank.branches$Postal <- str_extract(bank.branches$text, 
                               "[A-Z][0-9][A-Z][:space:][0-9][A-z][0-9]")

head(bank.branches, 20) %>% knitr::kable() %>%
        kable_styling(bootstrap_options = c("striped", "hover"))
page_id element_id text Routing Bank.nbr Bank.name Postal
1 1 SECTION I NUMERIC LIST / LISTE NUMÉRIQUE 1 NA NA NA NA
1 2 ADS CANADIAN BANK 343 NA NA NA NA
1 3 Routing Numbers / NA NA NA NA
1 4 Numéros d’acheminement NA NA NA NA
1 5 Electronic Paper(MICR) NA NA NA NA
1 6 Électronique Papier(MICR) Postal Address - Addresse postale NA NA NA NA
1 7 034300012 00012-343 20 Queen Street West - Suite 2600, Toronto, ON M5H 3R3 034300012 0343 NA M5H 3R3
1 8 034300092 00092-343 20 Queen Street West - Suite 2600, Toronto, ON M5H 3R3 034300092 0343 NA M5H 3R3
1 9 AMEX BANK OF CANADA 303 NA NA NA NA
1 10 Routing Numbers / NA NA NA NA
1 11 Numéros d’acheminement NA NA NA NA
1 12 Electronic Paper(MICR) NA NA NA NA
1 13 Électronique Papier(MICR) Postal Address - Addresse postale NA NA NA NA
1 14 030300012 00012-303 101 McNabb Street, Markham, ON L3R 4H8 030300012 0303 NA L3R 4H8
1 15 030300022 00022-303 101 McNabb Street, Markham, ON L3R 4H8 030300022 0303 NA L3R 4H8
1 16 030300032 00032-303 101 McNabb Street, 101 McNabb Street, Markham, ON L3R 4H8 030300032 0303 NA L3R 4H8
1 17 BANK OF AMERICA NATIONAL ASSOCIATION 241 NA NA NA NA
1 18 Routing Numbers / NA NA NA NA
1 19 Numéros d’acheminement NA NA NA NA
1 20 Electronic Paper(MICR) NA NA NA NA


That’s pretty much it in terms of cleaning the data to a point where we can then get the longitude/latitudes out of this dataset. Check out the corresponding map I made in the Projects section.

A look at teacher misconduct in Canada...

August 8, 2018
R data

RNN vs. Dense neural networks for time-series

June 27, 2018
R modelling

Auto neural networks vs. Manual Keras neural model

June 11, 2018
R modelling