I just love Regular Expressions. I first came in to contact with them many years ago while doing Perl development. Anyone having to pull data out of text documents or textual input fields must become familiar with Regex as the alternatives of linear token parsing or token searching pale in comparison to the intelligent parsing capability of a regular expression.
PCRE has become a cross-platform/cross-language standard for Regular Expressions supported in whole or part in Perl, PHP, Python, Ruby, C#, Javascript, and Actionscript.
Phone Number
Parses a 10-digit phone number optionally followed by an extension.
1 |
(?i)^(?:1[-\s]?)?\(?(\d{3})[\)-]?\s?(\d{3})[-\s]?(\d{4})(?:(?:\s?(?:(?:(?:ext\.?)|[x+-])\s?)?)?(\d+))?$ |
Expanded and documented:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
(?ix) # case-insensitive, permit whitespace and comments ^ # Start of input (?:1 # optional "1" Prefix [-\s]? # followed by optional dash or space )? # End of 1 Prefix \(? # optional left paren (\d{3}) # Group 1 - 3-digit Area Code [\)-]? # optional right paren or dash \s? # optional space (\d{3}) # Group 2 - 3-digit Prefix [-\s]? # optional dash or space (\d{4}) # Group 3 - 4-digit Suffix (?: # optional Extension (?: # optional Extension separator \s? # optional space (?: # optional delimiter (?:(?:ext\.?)|[x+-]) # "ext", "ext.", "x", plus or dash \s? # followed by optional space )? # End of delimiter )? # End of Extension separator (\d+) # group 4 - Extension )? # End of Extension $ # End of input |
These values will all be parsed successfully. Group captures are marked in red (1), green (2), blue (3), and orange (4):
- 9999999999
- 1-999–999–9999
- (999) 999–9999
- 1(999) 999–9999 ext 9999
- (999) 999–9999 x 9999
- 999–999–9999–9999
- 1-999–999–9999+9999
City, State, Zip Code
Parses City, State, and Zip Code.
1 |
^\s*(.+?)[,\s]\s*(AL|AK|AZ|AR|CA|CO|CT|DC|DE|FL|GA|HI|ID|IL|IN|IA|KS|KY|LA|ME|MD|MA|MI|MN|MS|MO|MT|NE|NV|NH|NJ|NM|NY|NC|ND|OH|OK|OR|PA|RI|SC|SD|TN|TX|UT|VT|VA|WA|WV|WI|WY)[,\s]\s*(\d{5}(?:-\d{4})?)(?:[,\s]\s*(.+?)\s*)?$ |
Expanded and documented:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
(?ix) # case-insensitive, permit whitespace and comments ^ # Start of input \s* # optional whitespace (.+?) # Group 1 - City Name [,\s] # required comma or whitespace \s* # optional whitespace # Group 2 - US State Acronym (AL|AK|AZ|AR|CA|CO|CT|DC|DE|FL|GA|HI|ID|IL|IN|IA|KS|KY|LA|ME|MD|MA|MI|MN|MS|MO| MT|NE|NV|NH|NJ|NM|NY|NC|ND|OH|OK|OR|PA|RI|SC|SD|TN|TX|UT|VT|VA|WA|WV|WI|WY) [,\s] # required comma or whitespace \s* # optional whitespace (\d{5} # Group 3 - Zip Code (?: # optional Zip+4 extension - # required dash \d{4} # required 4 digits )? # End of Zip+4 extension ) # End of Zip Code (?: # optional trailing text [,\s] # required comma or whitespace \s* # optional whitespace (.+?) # Group 4 - Trailing Text \s* # optional whitespace )? # End of trailing text $ # End of input |
These values will all be parsed successfully. Group captures are marked in red (1), green (2), blue (3), and orange (4):
-
Los Angeles, CA 90037
-
Fort Atkinson, WI 53538-0901 USA
-
New York NY 10001