Irregular Expressions
I’m not a big fan of the Regular Expression methods that come with ColdFusion. I find them cumbersome, at best. They’re great if you just want to check a string for a pattern, but if you’re replacing data using any sort of algorithm you’re hosed.
As an example, I’m working on a very simply content editor for a client’s website. A feature of the system is that they can type link names as plain test. Thus, if they wanted to go a page that they want to call "About Us", they simply click the link icon in TinyMCE and type in the English words, "About Us". When the content form is submitted the system translates this into a more meaningful URL, "index.cfm?event=About-Us.
The problem is that the site only receives the HTML markup which might look like this:
<p>Click here to read <a href="About Us" />about us</a>. </p>
<p>And, as an example, here’s <a href="Another Link" />another link to see</a>. </p>
<p>Lastly, here’s a real URL to <a href="http:/www.anothersite.com" />another site</a>. </p>
So, the idea is, I loop over all the matches I can find in the HTML and check the href value to see if it contains "https:/", "http:/" or "index.cfm". If so, I ignore it. Otherwise, I manipulate the href value so that I insert index.cfm and replace all non alphanumeric characters with hyphens.
So, the first thing I always do when I run into problems like this is try to use ReReplace. Frankly, I’m no regular expression wizard. I can usually slink by, but if there’s any solution to this problem with just regular expressions I sure don’t know.
The problem is that I need to make decisions conditionally and ReReplace statically replaces instances of a pattern in a string with another string or a limited pattern.
So, to do what I want I typically write a loop tag that looks till I break out of it. On each iteration of the loop I look for the pattern with ReFind. I parse the heck of the results and then rebuild the string I’m searching over. Lastly, I figure out where the next search should begin.
Gawd do I despise this technique. So, I decided to work out a better solution.
Wouldn’t it be nice if there was a way to call a handler function on each match that simply had to parse that one match and return the result? Well, here’s a very simply CFC that does just that:
<cffunction name="parse" access="public" hint="I am a function that can be used to more easily parse a set of matches in a particular way using regular expressions" output="false" returntype="string">
<cfargument name="string" hint="I am the string to parse" required="yes" type="string" />
<cfargument name="regex" hint="I am the regex to use" required="yes" type="string" />
<cfargument name="callbackFunction" hint="I am the call back function to use. This must accept one argument, an array of elements in one match." required="yes" type="string" />
<cfset var location = 0 />
<cfset var match = 0 />
<cfset var full = 0 />
<cfset var detail = 0 />
<cfset var result = 0 />
<cfset var x = 0 />
<cfset var left = 0 />
<cfset var right = 0 />
<cfset var change = 0 />
<cfloop condition="true">
<cfset detail = ArrayNew(1) />
<cfset match = ReFind(‘(<a href="(.+?)".*?>)(.+?)(<\/a>)’, arguments.string, location, true) />
<cfif match.len[1] IS 0>
<cfbreak />
</cfif>
<!— parse the match into chunks —>
<cfloop from="1" to="#ArrayLen(match.len)#" index="x">
<cfset detail[x] = StructNew() />
<cfif match.len[x]>
<cfset detail[x].string = Mid(arguments.string, match.pos[x], match.len[x]) />
<cfelse>
<cfset detail[x].string = "" />
</cfif>
<cfset detail[x].pos = match.pos[x] />
<cfset detail[x].len = match.len[x] />
</cfloop>
<cfif match.len[1]>
<cfset left = left(arguments.string, match.pos[1] – 1) />
<cfset right = right(arguments.string, Len(arguments.string) – (match.pos[1] + match.len[1]) + 1) />
<cfset change = arguments.callbackFunction(detail) />
<cfset arguments.string = left & change & right />
</cfif>
<cfset location = len(left) + len(change) + 1 />
</cfloop>
<cfreturn arguments.string />
</cffunction>
</cfcomponent>
This CFC has one method, parse, which accepts a string, a regular expression and a pointer to a method to that will handle matches. The handler method receives an array of structs. Each structure in the array has keys "string", "len" and "pos". The string is the string that was matched by the portion of the regular expression.
Let’s say I used this regular expression…
(<a href="(.+?)".*?>)(.+?)(<\/a>)
I would end up with five elements in my array: The entire link tag, the opening tag, the value of the href element, the text in the link, and the closing tag. Each of these elements would have the matched string. IE, element 4 would be "about us" for the first match in the HTML example above.
So, I can write a handler method like this:
<cfargument name="match" />
<cfset var return = "" />
<cfset var link = "" />
<!— parse any string that doesn’t start with http:/, https:/ or contain index.cfm —>
<cfif Left(arguments.match[3].string, 7) IS NOT "http:/" AND Left(arguments.match[3].string,
<!— open the link tag —>
<cfset return = arguments.match[2].string />
<!— fix the href in the link tag —>
<cfset link = "index.cfm?event=" & ReReplace(Trim(arguments.match[3].string), "\W", "-", "all") />
<cfset return = replace(return, arguments.match[3].string, link) />
<!— add the link text —>
<cfset return = return & arguments.match[4].string />
<!— close the tag —>
<cfset return = return & arguments.match[5].string />
<cfelse>
<cfset return = arguments.match[1].string />
</cfif>
<cfreturn return />
</cffunction>
This accepts the match and parses it. It simply looks to see if it’s an http, https or index.cfm link and parses it to the standard, if not.
Making this work is as simple as two remaining lines of code:
<cfset var RegEx = CreateObject("Component", "model.regex.Regex") />
<cfset content = RegEx.parse(content, ‘(<a href="(.+?)".*?>)(.+?)(<\/a>)’, fixLinks) />
Assuming we use the HTML aboive, the output from this method is:
<p>Click here to read <a href="index.cfm?event=About-Us" />about us</a>. </p>
<p>And, as an example, here’s <a href="index.cfm?event=Another Link" />another link to see</a>. </p>
<p>Lastly, here’s a real URL to <a href="http:/www.anothersite.com" />another site</a>. </p>
How handy is that? Isn’t it cool what can happen when you start coding to interfaces?

What about something like this?
Is there something I’ve missed in it?
Oops. Here’s a better version that doesn’t hide in the HTML:
<cfset content = rereplacenocase(content,”(<a href=”")(\w+)(\s+)(\w+)(“”>)”,”\1index.cfm?event=\2-\4\5″,”all”)>
Check for a link when the user enters the text, and if it’s not a know protocol wrap it in brackets.
Then, for the CF you can just do this:
<cfset content = REReplace(content,’href=”[([^]]*)]”‘,’href=”index.cfm?event=\1″‘,’all’)/>
Sammy / Peter – I figured someone would submit one line of code to invalidate my examples!
Anyhow, the thing I like about my Regex cfc is that I can handle every match on a case by case basis. Thus, I could do something like searching for any html tag and I could handle each match on a case by case basis.
Doug
Well, I do like that about it too! =)