Steve Fenton

Advanced Ruby gsub with regular expressions

This post is really about the Ruby language gsub string method. It does contain a tiny bit of Jekyll hooks, but they are important to me and perhaps not to you. If you just want to know how to extract a match in gsub and use it in the output, scroll down to the bottom for the “final revelation”.

Let’s set the scene for the problem. I’m processing a custom Markdown block on a Jekyll site during a hook that fires after conversion, but before the result is written to disk.

The custom Markdown block looks like this:

:::hint

Some content to be shown in a hint box.

:::

By the time the text is handed to me, it’s already been mostly processed by Jekyll’s Markdown parser, so what we’re dealing with is something like:

<p>:::hint</p>

<p>Some content to be shown in a hint box.</p>

<p>:::</p>

However, what we really want is for the ::: syntax to trigger a container with a class called “hint” (or whatever text has been added by the author), like this:

<div class="hint">

<p>Some content to be shown in a hint box.</p>

</div>

Jekyll hooks

We’re running inside a Jekyll hook, so there is a file named custom_html.rb inside my _plugins directory with a simple hook defined…

Jekyll::Hooks.register :pages, :post_convert do |item|
    # Do something with the item
end

This is where item comes from in the examples below and I’ll leave out the hook-specific code to keep the examples short.

gsub basics

You can do a simple replace using strings and gsub, like this:

content = "
<p>:::hint</p>

<p>Test</p>

<p>:::</p>"

content = content.gsub(':::', '<div>')

puts content

Basic gsub usage looks for the first string, and replaces it with the second one.

You can see from the output that this replaces the ::: strings, but this isn’t enough to solve our requirement just yet.

<p><div>hint</p>

<p>Test</p>

<p><div></p>

Our problems are:

  • We can’t tell the difference between opening and closing tags if we just use ‘:::’
  • We still have those pesky paragraph tags that shouldn’t be there
  • We are missing the class name and the text for it is now content

We can use our problem to explore some more advanced use cases for gsub.

gsub with regular expressions

We can tell the difference between a start and end tag using a regular expression. Don’t shudder, it’s not going to be that bad. The syntax for using a regular expression is shown below.

We use one gsub to find the opening tag, including the surplus paragraphs, and one to find the closing tag, replacing them as appropriate.

content = "
<p>:::hint</p>

<p>Test</p>

<p>:::</p>"

content = content
    .gsub(/<p>:::[a-z]+<\/p>/, '<div>')
    .gsub(/<p>:::<\/p>/, '</div>')

puts content

The key part of the regular expression is that [a-z]+ part, which explains that we expect to find some extra text on the opening tag that isn’t there on the closing tag.

Here’s the output.

content = "
<p>:::hint</p>

<p>Test</p>

<p>:::</p>

"

content = content
    .gsub(/<p>:::[a-z]+<\/p>/, '<div>')
    .gsub(/<p>:::<\/p>/, '</div>')

puts content

Our output is now valid HTML, but our class name is still missing. We’ll tackle that next.

Using a match from the regular expression in the output

We just need to fine tune our regular expression now to get hold of that class name, so we can use it in the output.

The first part of the change is to wrap parentheses around the text match, to say we want to capture the text that is found. To put it another way [a-z]+ already finds the text we want, but ([a-z]+) will keep hold of it for later use. Who knew brackets were so meaningful.

The second part of our update is to use the text we found in the output. The syntax for this is \1. Where you have multiple matchers, they are all numbered, so the next one is \2 and so on.

As we want to use the text as the class name, we’ll use '<div class="\1">'

content = "
<p>:::hint</p>

<p>Test</p>

<p>:::</p>

"

content = content
    .gsub(/<p>:::([a-z]+)<\/p>/, '<div class="\1">')
    .gsub(/<p>:::<\/p>/, '</div>')

puts content

Our output is now exactly what we want. We’re converting a markdown block into an HTML block with the appropriate class name.

<div class="hint">

<p>Test</p>

</div>

The key parts

To summarise, here’s the line of code with important bits called out:

content
    .gsub(/<p>:::([a-z]+)<\/p>/, '<div class="\1">')
#         ^ / starts and ends the regular expression
#                ^ brackets create the capture group
#                                             ^ \1 uses the first match in the output

Written by Steve Fenton on