09
Dec
07

HTML-Aware Truncate Text

When building a large custom PHP CMS system for DigitalPeach, I ran into a very difficult issue: truncating text but maintaining HTML nested tags correctly. Specifically, we were looking to breaking up large articles composed using FCKEditor into separate pages after a certain character threshold. Once can easily see the problem:

<p>This is a test <strong>with some bold in here</strong>.</p>

Now imaging having to truncate this text to 30 characters, and you end up with this:

This is a test <strong>with

While this example isn’t quite so severe and at worst would only make the rest of the text within the block-level element bold, clearly if we do a truncation that is blind to HTML some serious problems can arise. Furthermore, even if it doesn’t have much practical significance, you are breaking XHTML.

I was lucky to stumble across an excellent article by Mike Burns who describes a Ruby method using REXML’s pull parser that can accomplish this. His example extended the String class, so I modified it to work as a Rails helper all in one method:

def truncate_html(input, len = 30, extension = "...")
  def attrs_to_s(attrs)
    return '' if attrs.empty?
    attrs.to_a.map { |attr| %{#{attr[0]}="#{attr[1]}"} }.join(' ')
  end

  p = REXML::Parsers::PullParser.new(input)
    tags = []
    new_len = len
    results = ''
    while p.has_next? && new_len > 0
      p_e = p.pull
      case p_e.event_type
    when :start_element
      tags.push p_e[0]
      results << "<#{tags.last} #{attrs_to_s(p_e[1])}>"
    when :end_element
      results << "</#{tags.pop}>"
    when :text
      results << p_e[0].first(new_len)
      new_len -= p_e[0].length
    else
      results << "<!-- #{p_e.inspect} -->"
    end
  end

  tags.reverse.each do |tag|
    results << "</#{tag}>"
  end

  results.to_s + (input.length > len ? extension : '')
end

Note that the nested method above is a completely valid use of Ruby, though not widely used.

And now look at what it can do:

truncate_html("<p>Test <strong>bold</strong> done.</p>", 30)
# => "<p>Test <strong>bold</strong></p>..."

0 Responses to “HTML-Aware Truncate Text”


  1. No Comments

Leave a Reply




  • Ben Hughes

    I'm a freelance developer working with Ruby and other modern tools to build web applications, based currently out of Rochester, NY. I love to learn about new technologies and am always trying to achieve elegance and beauty through code.

    When I'm not writing software, I like to play tennis, dabble in jazz piano, and ponder economics. I'm a big fan of: world travel and cultures, jazz music, Korean food, coffee, and having interesting conversations.

  • Recommend Me
December 2007
M T W T F S S
    Jan »
 12
3456789
10111213141516
17181920212223
24252627282930
31