CommonMark tables: Specification Proposal

mh@tin-pot.net
2015-11-08

Introduction

The following is my attempt to come up with a set of rules for CommonMark tables,

  1. defining the syntax of pertaining CommonMark “table blocks”, and
  2. describing the transformation of these blocks into a generic table model,

which can be mapped into the output document type after parsing.

I think the rules are simple enough to understand and apply (at least I hope so, the wording could surely be improved a bit), as well as reasonable easy to implement, and they allow for both writing “nice” CommonMark text as well as more “terse” writing styles.

I think this “CommonMark tables” specification could provide a fundament for an actual extension of the CommonMark specification, and comments are highly appreciated so we can move forward to (finally) bring tables into the CommonMark spec.

Table model

The table specification here assumes that the table element in the target document type

  1. has a distinct, but optional, “table header” element; and

  2. encompasses the special case of a “degenerate” table consisting of a single cell only (ie no table header, only one row and column); and

  3. allows block content (paragraphs, lists, etc) inside the table data cells.

This is in fact the case for the W3C HTML 4.01 <table> content model:

<!ELEMENT (TH|TD)  - O (%flow;)*  -- table header cell, table data cell-->
<!ENTITY % flow "%block; | %inline;">

and for the ISO/IEC 15445:2000 HTML <table> element type:

<!ELEMENT (TH|TD)     - O  %table.content; >
<!ENTITY % table.content   "(%block; | %text;)*" >

and for the DocBook 3.1 CALS <Table> as well, where the table data cell is the <entry> element is too complex to quote here:

<!ELEMENT entry %a-whole-lot-of-stuff; >

and in the DocBook 5.1 CALS table too (but DocBook 5 also has a HTML table, compatible with XHTML).

And even the <tbl> element of the sample DTD for “general documents” in Annex E of ISO 8879:1986 could be used as a target, as each cell can contain paragraphs and lists etc:

<!ELEMENT  c   0 0  %m.pseq; -- Cell in body row -->
<!ENTITY % m.pseq  "(p, ((%s.p.d;)|(%ps.zz;))*)" -- Paragraph sequence -->

So there should be no problem to map our (simplistic) CommonMark table into the desired target document type (the same should hold for LaTeX and RTF etc too).

Overview

A table is obviously a container block, bearing some similarity with block quotes, thus the language used here paraphrases the block quote description of the CommonMark specification in some places—and a natural place in the specification would be as a new sub-section 5.4, at the end of the section describing “container blocks”.

The rules are intended to be general enough to allow the “abuse” of tables for things like poetry verses (see examples below):

|
And what I really want to know is this:
are things getting better
or are they getting worse?

would (or rather: should) be transformed into an HTML table (for example) like this:

<table><tbody>
<tr><td>And what I really want to know is this:<br>
are things getting better<br>
or are they getting worse?<br></td></tr>
</tbody></table>

This use (or misuse?) of tables was discussed here, and gave in fact the impetus for this proposal.

nbsp/1907/18

There is obvious room for enhancement: indicating the horizontal (and vertical?) alignment of columns (or rows, or even single cells?) is certainly useful, and PHP Markdown Extra [uses colons][PHP-extra- table] in “table rules” for this:

|:-----|:------:|------:|
| left | center | right |

However, I feel it would be premature to include something like this here, as there are too many open questions: should we also allow a similar syntax to specify alignment of section headings?:

##: Centered heading text :##

If not, why not? How would such alignment prescriptions map into target documents? What about vertical aligment?

A more urgent extension IMO would be a syntax to specify column spans and row spans, ie cells which extend over multiple adjacent columns and/or rows in the output table.

Prior art

As far as I can tell, the syntax rules given here also encompass the basic PHP Markdown Extra syntax for tables implemented for example in the discount parser by David Parsons (but there is no syntax for alignment and row spans or column spans defined in this CommonMark proposal yet); but the semantics are—intentionally—slightly different:

This block in PHP Markdown Extra

First Header  | Second Header
------------- | -------------
Content Cell  | Content Cell
Content Cell  | Content Cell

generates a table whith a <thead> containing the first row given; the proposed CommonMark specification here would write

First Header  | Second Header
============= | =============
Content Cell  | Content Cell
Content Cell  | Content Cell

to achieve the same result. The above example using “-” places all rows into the <tbody> element, which is the same element structure one gets from

------------- | -------------
First Header  | Second Header
------------- | -------------
Content Cell  | Content Cell
------------- | -------------
Content Cell  | Content Cell
------------- | -------------

(only visually different—depending on the output document format and renderer) or even

| First Header | Second Header
Content Cell | Content Cell
Content Cell | Content Cell

Proposed specification

5.4   Tables

The table marker|” VERTICAL LINE U+007C is used to designate a block of lines as a table, and to delimit CommonMark text destined for different columns in this table. Each row of the table consists of a sequence of table data cells.

5.4.1   Table block

A block of lines without intervening blank lines where

is transformed into a table in the output document. [OR: “… is a table” / “specifies a table” ?]

A line having the second form is called a table rule.

5.4.2   Table columns

Content between “|” in lines which are not table rules is split and distributed into the output table columns from left to right; the leading and the trailing “|” characers are optional.

Leading and trailing white space between column content and the “|” is discarded in the output.

A table may have only one column.

5.4.3   Table rows

The CommonMark text for each column is split into table data cells by introducing row breaks across all columns, so that each column has the same number of cells:

  1. A table rule introduces a row break across all columns

    except that leading and trailing table rules are ignored (they are only there to make the CommonMark typescript nicer).

    A table rule containing “=” separates the table header row from the table body, if there is only one (non-rule) line above it; otherwise it is treated like a table rule with “-”.

    A table rule containing “-” introduces a “visible” break between table rows in a multi-column table (depending on the output document type and renderer).

    A table rule without “=” or “-” introduces a “normal” break between table rules in a multi-column table.

  2. Other “row breaks” are introduced only if there are more than one column (in any one of the lines in the table block), and only if each of the plain text fragments in all columns allows it simultanously:

  3. Content in table data cells emanating from multiple lines in the CommonMark source will be separated by “hard line breaks” in the output table.

5.4.4   Examples

The minimal table has only one cell:

|Hi!

Accordingly, it produces a table containing a single table data cell:

<table><tbody>
<tr><td>Hi!</td></tr>
</tbody></table>

Because the HTML <td> can contain both block and inline elements, the <td> has just character data content in this case.

Single-column tables are not split into cells automatically, so this example

|Hello,
|there!

would reproduce the line break, but result in only one table data cell, too:

<table><tbody>
<tr><td>Hello,<br>there!</td></tr>
</tbody></table>

Because line breaks are taken “literally”, and paragraph structure is preserved in single-column tables, this can be used to format for example, verses and lyrics:

And what I really want to know is this:
are things getting better
or are they getting worse?
|
Can we start all over again?

Note that in this example, the line containing just the “|” suffices to mark this block of lines as a table: It is a table rule line, but lacking “=” or “-” it will not introduce a new row into a single- column table.

The table generated from this example has a single table data cell as well, but this time it contains two paragraphs (similar to a block quote containing a blank line):

<table><tbody>
<tr><td><p>And what I really want to know is this:<br>
are things getting better<br>
or are they getting worse?</p>
<p>Can we start all over again?</p></td></tr>
</tbody></table>

This could be written in a more elaborate style, but would produce the exact same result:

| And what I really want to know is this: |
| are things getting better               |
| or are they getting worse?              |
|                                         |
| Can we start all over again?            |

A single-column table can be broken into cells explicitly, using table rule lines containing “-” or “=”:

| One
|--
| Two
| and
|--
| Three

Now we get three table body rows, each with one table data cell, but still no table header row:

<table><tbody>
<tr><td>One</td></tr>
<tr><td>Two<br>and</td></tr>
<tr><td>Three</td></tr>
</tbody></table>

To produce a table heading row (an <thead> element in HTML), one has to use a table rule line with “=”:

| One
|=====
| Two
| and
|-----
| Three

Now the first line ends up as the content of the (single-cell) table heading row:

<table><thead>
<tr><td>One</td></tr></thead>
<tbody><tr><td>Two<br>and</td></tr>
<tr><td>Three</td></tr>
</tbody></table>

Multi-column tables are usually split into cells line by line:

| A1 | A2
| B1 | B2

which can be written somewhat terser as:

| A1 | A2
B1 | B2

or in the equivalent syntax (using a table rule line):

---|----
A1 | A2
B1 | B2

or (nicer?)

A1 | A2
---|----
B1 | B2

They all produce the same element structure information:

<table><tbody>
<tr><td>A1</td><td>A2</td></tr>
<tr><td>B1</td><td>B2</td></tr>
</tbody></table>

Lists and blockquotes will not be split into adjacent cells in different rows: for example

------|-----
- A1a | A2a
- A1b | A2b
- B1  | B2

will only produce one row, because the left column contains a single unordered list.

But the first column here

------|-----
- A1a | A2a
- A1b | A2b
B1    | B2

has only a two-item list, followed by a new paragraph: this allows a row break below - A1b and above B1, and below A2b anyway, and we get:

<table><tbody>
<tr><td><ul><li>A1a</li><li>A1b</li></ul></td>
    <td>A2a<br>A2b</td></tr>
<tr><td>B1</td><td>B2</td></tr>
</tbody></table>

Note that the “line break” between A2a and A2b is reproduced again using a <br> in the upper-right table data cell.

Without the unordered list, we would get three rows out of of the “regular” table

------|-----
A1a   | A2a
A1b   | A2b
B1    | B2

To keep “lines” in a “paragraph” (of the content fragments in a column) together, one can indent the following lines a bit (using 0 to 3 spaces, relative to the preceding “|” or line start. This “joins” the first line of a paragraph with the subsequent lines together, and prohibits row breaks to be inserted:

------|-----
A1a   | A2a
 A1b  | A2b
B1    | B2

will “join” the A1a and A1b in the left column, and similarly

------|-----
A1a   | A2a
A1b   |  A2b
B1    | B2

would “join” the A2a and A2b in the right column.

Both of these table blocks transform into the exact same element structure:

<table><tbody>
<tr><td>A1a<br>A1b</td>
    <td>A2a<br>A2b</td></tr>
<tr><td>B1</td><td>B2</td></tr>
</tbody></table>

Here the line breaks in both cells in the upper row are reproduced, and the resulting table structure shows a close similarity to the one in the example above using an unordered list—as do the CommonMark input texts for both examples.


Valid ISO/IEC 15445:2000 © 2015 tin-pot.net CC BY-SA 4.0 license applies CC BY-SA 4.0 licenced

$Date: 2015-11-08 13:08:13 +0100 (So, 08 Nov 2015) $