RSS 2.0 is an incredibly useful and widely adopted standard. Unfortunately, the RSS 2.0 Specification is significantly underspecified in a number of areas. One such area is the form and meaning of the <description> element. The <description> element may appear in two different contexts within an RSS document: as a channel property or an item property. In the context of a channel property, the specification says only that the description element contains: Phrase or sentence describing the channel.
In the context of an item property, the specification says: A channel may contain any number of <item>s. An item may represent a "story"—much like a story in a newspaper or magazine; if so its description is a synopsis of the story, and the link points to the full story. An item may also be complete in itself, if so, the description contains the text (entity-encoded HTML is allowed), and the link and title may be omitted. All elements of an item are optional, however at least one of title or description must be present.
The specification is unclear or ambiguous in several ways. Specific ambiguities are addressed below, along with recommended practices designed to avoid—or at least minimize—interoperability problems. Issue: Does a description element contain HTML markup?It is difficult to answer this question automatically (in XSL, for example). Consider this channel description: <description>Best Practices for Use of <BR&;lt; Tags in RSS 2.0 Descriptions</description> One could reasonably interpret this, according to the specification, as being a "phrase or sentence describing the channel" about the use of <BR> tags. In this interpretation, the content does not contain HTML markup. One could just as reasonably interpret this as being a description of a channel about the use of tags, in general, where the description contains a line break in the middle of the text. In this interpretation, the content does contain HTML markup. Ideally, the specification will one day eliminate this ambiguity, perhaps with the addition of a description attribute such as content=html|text. Until this issue is addressed formally, interoperability problems will continue to surface. Recommendation: Feed authors (or generators) should adhere to the following guidelines: - Descriptions that contain HTML markup SHOULD contain at least one HTML tag.
- Descriptions that do not contain HTML markup SHOULD NOT contain a less than character value.
These guidelines allow a simple test for the presence of markup (that can be easily performed in XSL): does the (parsed) description content contain a '<' character value? If so, the description contains markup. If not, it is plain text. Issue: Can a channel description contain entity-encoded HTML?The specification does not specifically indicate that it can, but many actual RSS feeds do carry HTML in the channel description. Recommendation: Assume the content model for <description> is the same in all contexts. In other words, assume that channel descriptions, like item descriptions, might contain entity-encoded HTML markup. Issue: How should non-ASCII characters be encoded?The specification is silent on this question. The representation of a non-ASCII character in both HTML and RSS (i.e., XML) is a function of the character set in effect for the document. For example, the character code 147 decimal is a curly left double quote in the Windows character set. This character is encoded as a single byte value when using ISO-8859-1. It is encoded as a two by sequence when using UTF-8. Problems arise when you combine these to facts: - RSS documents are not limited to a particular character set; and
- RSS descriptions contain entity-encoded HTML.
Specifically, this makes it very difficult to use XSL reliably to generate HTML output from an RSS description. The reason is that since the description comment already contains HTML markup as a string value and not as XHTML elements, the XSLT processor must be prevented from performing HTML escaping on the description content to prevent displaying the HTML markup as description text. Unfortunately, this also prevents the XSLT processor from correctly translating non-ASCII characters in cases where the RSS document and the HTML output document use different character sets (e.g., an ISO-8859-1 RSS document being used to generate a UTF-8 HTML document). Recommendation: Descriptions that contain HTML markup SHOULD NOT contain non-ASCII byte values. An RSS author (or generator) can always avoid the use of non-ASCII characters by using character entities instead. For example, in the case of the non-ASCII character 147 decimal mentioned earlier, an RSS descriptions should use the character entity “ instead of the literal byte value. Character entities are always expressible using ASCII characters, thereby avoiding the encoding issues. Note for RSS Authors: Remember that an HTML entity such as “ or — must be properly escaped according to XML syntax when creating the RSS document. Specifically, the ampersand character in each entity must be escaped: &#147; and &mdash;, respectively. |