Markup and Data Formats Back
This chapter mainly covers some regular expressions about markup languages and some data formats: HTML, XHTML, XML, CSV, and INI. However some of these formats can be surprisingly complex to process and manipulate accurately, at least using regular expressions.
Basic rules for formats covered in this chapter
Hypertext Markup Language (HTML)
HTML is used to describe the structure, semantics, and appearances of billions of web pages and other documents. In this chapter, we'll only concentrate on the rule needed to process the key components of well-formed HTML: elements (and the attributes they contain), character references, comments, and document type declarations.
Extensible Hypertext Markup Language (XHTML)
XHTML is designed as successor (接替者) to HTML 4.01, and migrated HTML from its SGML heritage (遺產) to an XML foundation. However, because development of HTML continued separately, XHTML5 is now being developed as part of the HTML5 specification, and will be the XML serialization of HTML5 rather than introducing new features of its own.
Extensible Markup Language (XML)
XML is a general-purpose language designed primarily for sharing structured data. It's used as the foundation to create a wde array of markup languages, including XHTML.
Comma-Separated Values (CSV)
CSV is an old but still very common file format used for spreadsheet-like data.
Initialization files (INI)
The lightweight INI file format is commonly used for configuration files. It is poorly defined, and as a result, there is plenty of variation in how different programs and systems interpret the format.
Find XML-style tags
Problem
You may want to match any HTML, XHTML, or XML tags in a string, how?
Solution
Quick and dirty:
/<[^>]*>/
Allow
>
in attribute values:/<(?:[^>"']|"[^"]*"|'[^']*')*>/
(X)HTML tags (loose):
/<\/?([A-Za-z][^\s>\/]*)(?:=\s*(?:"[^"]*"|'[^']*'|[^\s>]+)|[^>])*(?:>|$)/
(X)HTML tags (strict):
/<(?:([A-Z][-:A-Z0-9]*)(?:\s+[A-Z][-:A-Z0-9]*(?:\s*=\s*(?:"[^"]*"|'[^']*'|[^"'`=<>\s]+))?)*\s*\/?|\/([A-Z][-:A-Z0-9]*)\s*)>/
XML tags (strict):
/<(?:([:A-Z][-.:\w]*)(?:\s+[:A-Z][-.:\w]*\s*=\s*(?:"[^"]*"|'[^']*'))*\s*\/?|\/([_:A-Z][-.:\w]*)\s*)>/
Discussion
If you want to sterilize (卸除) HTML from untrusted sources because you're worried about specially-crafted malicious HTML and cross-site scripting (XSS) attacks, your safest bet is to first convert all
<
and>
and&
characters to their corresponding named character references (<
,>
,&
), then bring back tags that are known to be safe (as long as they contain no attributes or only using approved attributes). For example, use the following code to bring back<p>
,<em>
, and<strong>
:subject = subject.replace(/<(/?)(p|em|strong)>/, '<$1$2>');
Replace <b> tags with <strong>
Problem
How to replace all <b> tags in a string with corresponding <strong> tags?
Solution
function replaceB(subject) { subject = subject.replace(/<(\/?)b\b((?:[^>"']|"[^"]*"|'[^']*')*)>/ig, '<$1strong$2>'); }
Discussion
If you want to match any tag from a list of tag names, you can use the regex like this: /<(\/?)([bi]|em|big)\b((?:[^>"']|"[^"]*"|'[^']*')*)>/ig.
Remove all XML-style tags except <em> and <strong>
Problem
You may want to remove all tags in a string except
<em>
and<strong>
.Solution
Match tags except
<em>
and<strong>
:function replace(subject) { subject = subject.replace(/<\/?(?!(?:em|strong)\b)[a-z](?:[^>"']|"[^"]*"|'[^']*')*>/gi, ''); }
Match tags except
<em>
and<strong>
, and any tags that contain attributes:function replace(subject) { subject = subject.replace(/<\/?(?!(?:em|strong)\s*>)[a-z](?:[^>"']|"[^"]*"|'[^']*')*>/gi, ''); }
Discussion
Consider the case when you need to match all tags except
<a>
,<em>
, and<strong>
, with two exceptions:- Any
<a>
tags that have attributes other thanhref
ortitle
should be matched Any
<em>
or<strong>
tags with attributesfunction replace(subject) { subject = subject.replace(/<(?!(?:em|strong|a(?:\s+(?:href|title)\s*=\s*(?:"[^"]*"|'[^']*'))*)\s*>)[a-z](?:[^>"']|"[^"]*"|'[^']*')*> /gi, ''); }
- Any
Convert plain text to HTML by adding <p>p; and <br>p; tags
Problem
Given a plain text string, you may want to convert it to an HTML fragment to display within a web page. Paragraphs, separated by two line breaks in a row, should be surrounded with
<p>...</p>
. Additional line breaks should be replaced with<br>
tags.Solution
Step 1: replace HTML special characters with named character references
subject = subject.replace(/&/ig, '&'); subject = subject.replace(/</ig, '<'); subject = subject.replace(/>/ig, '>');
Step 2: replace all line breaks with
<br>
subject = subject.replace(/\r\n?|\n/g, '<br>');
Step 3: replace double
<br>
tags with</p><p>
subject = subject.replace(/<br>\s*<br>/g, '</p><p>');
Step 4: wrap the entire string with
<p>...</p>
subject = '<p>' + subject + '</p>';
Find a specific attribute in XML-style tags
Problem
How to find a specific attribute such as id within an (X)HTML or XML file?
Solution
Tags that contain an id attribute (quick and dirty):
/<[^>]+\sid\b[^>]*>/i
Tags that contain an id attribute (more reliable):
/<(?:[^>"']|"[^"]*"|'[^']*')+?\sid\s*=\s*("[^"]*"|'[^']*')(?:[^>"']|"[^"]*"|'[^']*')*>/i
<div>
tags that contain an id attribute:/<div\s(?:[^>"']|"[^"]*"|'[^']*')*?\bid\s*=\s*("[^"]*"|'[^']*')(?:[^>"']|"[^"]*"|'[^']*')*>/i
Tags that contain an id attribute with the value "my-id":
/<(?:[^>"']|"[^"]*"|'[^']*')+?\sid\s*=\s*(?:"my-id"|'my-id')(?:[^>"']|"[^"]*"|'[^']*')*>/i
Tags that contain "my-class" within their class attribute value:
function match(subject) { var result = /^(?:[^>"']|"[^"]*"|'[^']*')+?\sclass\s*=\s*("[^"]*"|'[^']*')/gi.exec(subject) || []; var resultLen = result.length; var results = []; for (var i = 0; i < resultLen; i++) { results.concat(result[i].match(/["'\s]my-class["'\s]/i) || []); } }
Add a cellspacing attribute to <table> tags that do not already include it
Problem
You may want to search through an (X)HTML file and add
cellspacing="0"
to all tables that do not already include acellspacing
attribute.Solution
Simplistic:
function insert(subject) { subject = subject.replace(/<table\b(?![^>]*?\scellspacing\b)([^>]*)>/ig, '<table cellspacing="0"$1>'); }
More reliable:
function insert(subject) { subject = subject.replace(/<table\b(?!(?:[^>"']|"[^"]*"|'[^']*')*?\scellspacing\b)((?:[^>"']|"[^"]*"|'[^']*')*)>/ig, '<table cellspacing="0"$1>'); }
Remove XML-style comments
Problem
You may want to remove comments from an (X)HTML or XML document.
Solution
function removeComments(subject) { subject = subject.replace(/<!--[\s\S]*?-->/ig, ''); }
Discussion
If we want to match valid XML comments, we can use the regular expression to solve it: /<!--[^-]*(?:-[^-]+)*--\s*>/ig, while if we want to match valid HTML comments, we can use another regex: /<!--(?!-?>)[^-]*(?:-[^-]+)*-->/ig.
Find words within XML-style comments
Problem
How to find all occurrences of the word TODO within (X)HTML or XML comments?
Solution
/\bTODO\b(?=(?:(?!<!--)[\s\S])*?-->)/i
Change the delimiter (定界符) used in CSV files
Problem
You want to change all field-delimiting commas in a CSV file to tabs.
Solution
function commas2Tabs(subject) { var regex = /(,|\r?\n|^)([^",\r\n]+|"(?:[^"]|"")*")?/g; var result = ''; var match; while (match = regex.exec(subject)) { /** check the value of backreference 1 */ if (match[1] === ',') { /** * Add a tab (in place of the matched comma) and backreference 2 to the result. * If backreference 2 is undefiend use an empty string instead. */ result += '\t' + (match[2] || ''); } else { /** add the entire match to the result */ result += match[0]; } /** * If there is an empty match, prevent some browsers from getting stuck in an infinite loop */ if (match.index === regex.lastIndex) { regex.lastIndex++; } } subject = result; }
Extract CSV fields from a specific column
Problem
How to extract every field from the third column of a CSV file?
Solution
function getCSVColumn(csv, index) { var regex = /(,|\r?\n|^)([^",\r\n]+|"(?:[^"]|"")*")?/g; var result = []; var columnIndex = 0; var match; while (match = regex.exec(csv)) { /** * Check the value of backreference 1. If it's a comma, * increment columnIndex. Otherwise, reset it to zero. */ if (match[1] === ',') { columnIndex++; } else { columnIndex = 0; } if (columnIndex === index) { /** Add the field (backref 2) at the end of the result array */ result.push(match[2]); } /** * If there is an empty match, prevent some browsers from getting * stuck in an infinite loop */ if (match.index === regex.lastIndex) { regex.lastIndex++; } } csv = result; }
Discussion
Although using code to iterate over a string one CSV field at a time allows for extra flexibility, if you're using a text editor t get the job done, you may be limited to just search-and-replace:
Match a CSV record and capture the field in column 1 to backreference 1:
function replace(subject) { subject = subject.replace(/^([^",\r\n]+|"(?:[^"]|"")*")?(?:,(?:[^",\r\n]+|"(?:[^"]|"")*")?)*/m, '$1'); }
Match a CSV record and capture the field in column 2 to backreference 1:
function replace(subject) { subject = subject.replace(/^(?:[^",\r\n]+|"(?:[^"]|"")*")?,([^",\r\n]+|"(?:[^"]|"")*")?(?:,(?:[^",\r\n]+|"(?:[^"]|"")*")?)*/m, '$1'); }
Match a CSV record and capture the field in column 3 or higher to backreference 1
function replace(subject) { subject = subject.replace(/^(?:[^",\r\n]+|"(?:[^"]|"")*")?(?:,(?:[^",\r\n]+|"(?:[^"]|"")*")?){1},([^",\r\n]+|"(?:[^"]|"")*")?(?:,(?:[^",\r\n]+|"(?:[^"]|"")*")?)*/m, '$1'); }
Match INI section headers
Problem
You may want to match all section headers in an INI file.
Solution
/^\[[^\]\r\n]+]/m
Discussion
If you only want to find a specific section header, that's even easier.
**/^\\[Section1]/m**
Match INI section blocks
Problem
You may need to match each complete INI section block in order to split up an INI file or process each block separately.
Solution
/^\[^\]\r\n]+?)*/m
Match INI name-value pairs
Problem
You may want to match INI parameter name-value pairs (e.g., item1=value1), separating each match into two parts using capturing groups.
Solution
/^([^=;\r\n]+)=([^;\r\n]*)/m