Source Code and Log Files Back

Regular expressions are an excellent solution for tokenizing input while constructing a parser for a custom file format or scripting languages. This chapter mainly discuss some recipes for matching syntactic elements that are commonly used in programming languages and other text-based file formats.

In addition, the chapter will also show you how to extract information from log files.

Keywords

  • Problem

    Extract some reserved keywords such as "end", "in", "inline", "inherited", "item", and "object".

  • Solution

    /\b(?:end|in|inline|inherited|item|object)\b/i

  • Discussion

    There is another way when making our regex match both keywords and strings by using: /\b(end|in|inline|inherited|item|object)\b|'[^'\r\n]*(?:''[^'\r\n]*)*'/i.

Identifiers

  • Problem

    How to use a regular expression to match any identifier in a source code?

  • Solution

    /\b[a-z][0-9a-z]{0,31}\b/i

Operators

  • Problem

    You may need a regular expression that matches any of characters that can be used as operators in the programming language.

  • Solution

    /[+-*\/=<>$&^|!~?/]/

Single-line comments

  • Problem

    You may want to match a comment that starts with // and runs until the end of the line.

  • Solution

    /\/\/.*/

Multiline comments

  • Problem

    What if comments you want to match is a multiline comment which starts with /* and ends with */?

  • Solution

    /\/\*[\s\S]*?\*\//

All comments

  • Problem

    Combine both single-line and multiline comments.

  • Solution

    /\/\/.*|\/\*[\s\S]*?\*\//

Strings

  • Problem

    Match string literal wrapped with "" or ''.

  • Solution

    /"[^"\r\n]*(?:""[^"\r\n]*)*"|'[^'\r\n]*(?:''[^'\r\n]*)*'/

  • Discussion

    If strings can include line breaks, simply remove them from the negated character classes: /"[^"]*(?:""[^"]*)*"|'[^']*(?:''[^'']*)*'/.

Strings with escapes

  • Problem

    A double or single quote can also be included in the string by escaping it with a backslash (\), and in such cases, how can we change the regular expression to match them?

  • Solution

    /"[^"\\\r\n]*(?:\\.[^"\\\r\n]*)"|'[^'\\\r\n]*(?:\\.[^'\\\r\n]*)'/

Regex literals

  • Problem

    How to use regular expression to match regular expression literals from a strnig?

  • Solution

    /=:(,?+\s*(\/[^\/\\\r\n]*(?:\\.[^\/\\\r\n]*)*\/)/

Common log format

  • Problem

    You may need a regular expression that matches each line in the log files produced by a web server that uses the Common Log Format. For example, 127.0.0.1 - jg [27/Apr/2012:11:27:36 +0700] "GET /regexcookbook.html HTTP/1.1" 200 2326

  • Solution

    /^(\S+) \S+ (\S+) \[([^\]]+)\] "([A-Z]+) ([^ "]+)? HTTP\/[0-9.]+"([0-9]{3}) ([0-9]+|-) "([^"]*)" "([^"]*)"/

Combined log format

  • Problem

    You may need a regular expression that matches each line in the log files produced by a web server that uses the Combined Log Format. For example, 127.0.0.1 - jg [27/Apr/2012:11:27:36 +0700] "GET /regexcookbook.html HTTP/1.1" 200 2326 "http://www.regexcookbook.com/" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)"

  • Solution

    /^(\S+) \S+ (\S+) \[([^\]]+)\] "([A-Z]+) ([^ "]+)? HTTP\/[0-9.]+"([0-9]{3}) ([0-9]+|-) "([^"]*)" "([^"]*)" "([^"]*)" "([^"]*)"/

  • Problem

    In the case when you have a log for your website in the Combined Log Format, and how to check the log for any errors caused by broken links on your own website?

  • Solution

    /"(?:GET|POST) ([^#? "]+)(?:[#?][^ "]*)? HTTP\/[0-9.]+" 404 (?:[0-9]+|-) "(http:\/\/www\.yoursite\.com[^"]*)"/

results matching ""

    No results matching ""