Home > General > Regular Expressions

Regular Expressions

Lately I was working on a small project and I had to use some regular expressions. I thought I should write a post on the regular expression syntax patterns. And here it is. So let’s go.

First of all what are regular expressions? A regular expression is a set of characters that can be compared to a string to determine whether a string meets specified format requirements. You can also use regular expressions to extract portions of the text or to replace text. So let’s take a look at the syntax of regular expressions.

How to Match Simple Text

The simplest way to use regular expressions is to determine whether a string matches a pattern. For example, the regular expression “test” will match the string “This is a test”, “test”, and “Welcome to test1″ because all contain the regular expression.

How to Match Text in Specific Locations

If you want to match text beginning at the first character of a string, start the regular expressionwith a “^” symbol. For example the regular expression “^test” will match “test1″ but not “this is a test”. To match text that is at the end of a string use the “$” symbol. For example the following regular expression “test$” will give us all those sentences ending with “test”.

When searching for words you can use word boundaries. A word boundary is specified by “\b” and a non-word boundary is specified by “\B”. For example “test\b” will match “test” and “the test” however it will not match “test1″. Now “test\B” matches “test1″ but not “the test”.

The following are the characters that you can use in regular expressions to spefiy location:

  • ^ – specifies that the match must begin at the first character of a string of the first character of a line in multi-line input.
  • $ – specifies that the match must end at the last character of a string, or the last character before \n at the end of the string, or the last character at the end of a line.
  • \A – specifies that the match must begin at the first character of a string and ignores multi-line.
  • \Z – specifies that the match must end at the last character of a string or the last character before \n before the end of the string and ignores multi-line.
  • \z – specifies that the match must end at the last character of a string.
  • \G – specifies that the match must occur at the point where the previous match ended.
  • \b – specifies that the match must occur on a word boundary.
  • \B – specifies that the must must not occur on a \b boundary.

How to match Special Characters

You can match special characters in regulare expressions. For example \t is a tab while \n represents a newline. The following are special characters in regular expressions:

  • \a – matches a bell.
  • \b – denotes a word boundary.
  • \t – matches a tab.
  • \r – matches a carriage return.
  • \v – matches a vertical tab.
  • \f – matches a form feed.
  • \n – matches a new line.
  • \e – matches an escape.
  • \ – when followed by a character that is not recognized as an escaped character, matches that character. For example \* represents an asterisk while \\ represents a backslash “\”.

How to match text using Wildcards

You can use regulare expressions to match repeated characters. The “*” symbol means zero or more occurrences of a character. For example “go*d” matches “gd”, “god”, “good”, “goood”, and so on. The “+” symbol works similarly however it requires 1 or more occurrences of a character. Therefore “go+d” matches “god”, “good”, “goood” and so on, but does not match “gd” because “o” must occur at least once.

To match a specific number of repeated characters use “{n}” where n is the number. Therefore “go{2}d” will only match “good”. To match a range of repeated characters you can use “{min, max}”. Therefore “go{1,2}d” will match “god” and “good”. You can leave the second number blank to specify a minimum. For example “go{2,}d” will match “good”, “goood” and so on but will not match “god”.

To specify an optional character use the “?” symbol.  For example “goo?d” will match “god” or “good” only. The “.” symbol means any character. This means that “.ad” will match “bad”, “sad”, “dad” and so on.

To match one of several characters you use the “[]” syntax. For example “Mar[kc]” can match “Mark” or “Marc” only.

The following are all the characters used to match multiple characters or a range of characters:

  • * – matches the preceding character or expression zero or more times.
  • + – matches the preceding character or expression one or more times.
  • ? – matches the preceding character or expression zero or one time.
  • {n} – where n > 0 matches exactly n times.
  • {n,} – where n > 0 matches at least n times.
  • {n,m} – where n and m are non-negative integers matches at least n times and at most m times.
  • . – matches any character except “\n”.
  • x | y – matches either x or y.
  • [xyz] – matches any of the enclosed characters.
  • [a-z] – matches a range of characters; from a to z in this example.

Regular expressions also provide special characters to represent common character ranges. For example “[0-9″ can be represented by “\d”. Here is the complete list:

  • \d – matches a digit character.
  • \D – matches a non-digit character.
  • \s – matches any white space character including tab, space, and form feed.
  • \S – matches any non white space character.
  • \w – matches any word character, including underscore and numbers.
  • \W – matches any non word character.

So this was a basic introduction to regular expressions. The above will be enough for the majority however there are more complex stuff related to regular expressions. For those interested you can check one of the following very good books:


Categories: General
  1. January 29, 2009 at 6:03 pm | #1

    Thank you. I read herelots of valuable sentences. Greetings from Poland.

  1. January 15, 2009 at 7:13 pm | #1