I am planning to merge a modified implementation of the message
formatting code that processes message that are about to be displayed in
the chat conversation window. This code improves on the following aspects:
1. A different approach to handling (mixed) plain text and html
messages. I process the plain text message once, strictly leveraging the
fact that the code knows exactly when text is plain text and when it
gets turned into HTML (i.e. at least partially formatted for displaying
2. Not using <plaintext> tags anymore, since these cancel any active
formatting, such as light-grey from system messages, headings and such.
3. Strict HTML escaping for any text that should be displayed as is,
this includes names of chat rooms, members and subjects. This is
important since IRC channel names allow for almost any character, so you
can create a small HTML document as a chat room name, and we'd like to
prevent HTML injection.
4. Now that <plaintext> tags are gone, I've taken a different approach
to processing messages based on the plain text parts of the message:
this approach searches for text in between the HTML tags for processing
and escapes it automatically such that you can process based on the
original content, instead of the HTML-escaped content.
5. Separate implementations, based on a public interface for the
preprocessing steps that are executed on all incoming messages. There is
an interface called 'Replacer' for which a number of implementations
exist that were previously available as individual methods. (A separate
package is created for these implementations.)
6. Fixed processing URL's in HTML text. (This previously only worked for
discovering hyperlinks in plain text message.)
7. Fixed ordering of processing keywords vs. hyperlinks: previously if a
keyword was highlighted, it would break the URL and the URL wouldn't be
recognized as such any more.
8. And finally, I've improved to regexp that searches for plain text
content to recognize both tags (by <> brackets) and attribute values by
quotes (") such that we can quite reliably distinguish parts from HTML
tags from actual text content.
I plan to push these changes by tomorrow evening if there are no
objections. I'd like to give you all a chance to respond beforehand,
because this touches some quite significant code. If I need to wait to
do some more testing, I'm also fine with that.
You can find the code at
Any feedback is welcome!