It depends on your regex engine. The author here is concatenating all of the paths into a single pattern, an automata based engine would ideally compile away the disjunction and offer performance linear with the input path length.
It's funny that people are coming up with these libraries while comparing performance to the Symfony routing component, yet Symfony does exactly what you describe (dumping the routes to more or less a trie structure) if you activate caching, which is done in the standard framework distribution.
Not that I'm slagging off their efforts, in fact I plan to use FastRoute in one of my projects.
Now that I read the article, seems like they're merging all the regexps into one, which really could be O(n). However, the trie will not be O(log n). You have to at least look at every character in the input, so there's a hard O(n) lower bound.
You're confusing n size of route set, with k length of the path to be matched. With a trie the worst case running time to match a route grows O(log n) with an array of regexps it's O(n). The length of a path string to be matched is independent of how many routes are in the route set.
In theory, the combined regex could be O(1) in the size of the route set.
Of course, in practice it's not. PCRE is backtracking-based rather than automaton-based, so increasing the complexity of the regex does have an effect on runtime.
Which probably means we should be using a NFA-based engine for routing, rather than PCRE, because I don't think "don't backtrack" is a very serious restriction for routing. (Yes, of the hundreds of developers who may read this, one of them may have used backtracking in a routing expression. But in general I don't think it's that serious a restriction.)
It's a shame that the regex API doesn't seem to provide a way to indicate which part of a (?:foo|bar|...) matched. The underlying regex code must know this. A large part of the article is coming up with workarounds to solve this problem.
PCRE actually provides this information. There's a feature called backtracking control verbs, and one of those is (* MARK:NAME). You could put a (* MARK:A), (* MARK:B), etc at the end of every path and PCRE would tell you which of the paths was taken. However PHP does not expose this information - just like it does not expose the (?C) callout feature :(
Maybe I should implement those two things.
Note: There is no whitespace between * and MARK, but if I remove it HN formats it as an italic string.
It could probably be done by actually doing a replacement. Make all but the first group in each route non-capturing, and then replace it with something like " $1 $2 $3 .... $n", and count the spaces. You'd have to then do another regex just for that route with all groups capturing. So for 100 routes with 9 placeholders each you would do 1 regex replace that worst-case had to populate 100 groups, and then a second replace for the identified route to populate 9 capturing groups. Does that make sense?
Not sure I understand this comment. Your code snippet uses named subpatterns, but not within a (?| group. The interesting question is whether you can combined those two. Without (?| you can use named captures in PCRE as well (there's just little point in this context as a non-named capture would do equally well and be more efficient.)
Oh, hmm, yeah, fair point. I was mistakenly assuming that using named captures would obviate the creation of N spurious capture groups but a quick benchmark proves that horribly wrong. Sorry all!
Perl 5.10 from 2007 supports named captures. (Google confirmed that.)
Edit: nikic in the parallel comment is correct, named captures are affected the same way in Perl's regexp engine with ?|. (Google perlre and search for ?|.)
> It's a shame that the regex API doesn't seem to provide a way to indicate which part of a (?:foo|bar|...) matched. The underlying regex code must know this.
As lmm points out (I think), one can get a big advantage precisely by not requiring the regex code to know this. (Regexes are 'really' DFAs, and a DFA that makes it to its start state not just does not but can not know how it got there.) The more power one gives regexes, the less, well, regular they are, and the slower they become to match.
A DFA can give you more than just a yes/no decision, though sometimes they're narrowly defined as only machines that give yes/no decisions. But if you have multiple accept states and number or label the states, a DFA can report not only 'accept' but 'accept' plus the label of the accepting state (this could be seen as a special case of a Moore machine, in which you only care about the last output).
You'd have to compile the DFA differently, of course. A really simple version would just compile each alternation into its own DFA separately, and then chain them together with "try each in sequence". That's still a DFA, so it's not a matter of a different class of computations, though it's often slower by a constant factor (due to not being able to share any information between the alternations when matching them).
Good point: a DFA can do exactly what joosters asked, namely, distinguish between "matched regex '(a|b)' to input 'a'" and "matched regex '(a|b)' to input 'b'". However, it may be worth noting—as I'm sure you know, but others may not—that there is no way for even a juiced-up DFA to say, for example, how many a's were actually in the input when the regex is 'a * ' (spaces to fool the Markdown processor).
P.S. I might as well mention that I obviously meant to refer to a DFA reaching its end, rather than start, state.
You can compile the DFA exactly as you would with a single regex as input, just labeling each alternative's accept state differently. The subset construction would put a set of these labels on each DFA accepting state. When I coded this I had it break these ties by picking the minimum label in each set, so earlier alternatives win:
Thanks; I hadn't heard of this. To save others hunting down the link, here's a StackOverflow question with a (very little) discussion, a link, and related code:
Then just prepend 'xxx' to the front of uri in the call to preg_match. Now the second entry in matches tells you which handler, so for example for uri = '/user/nikic/42' you pass 'xxx/user/nikic/42' and get back 'xxx/user/nikic/42', 'x', 'nikic', and '42' where that 'x' tells you it's the first handler.
I'm not an expert in how php does REs but I imagine there may be a way to use preg_replace or similar to replace the first '/' in the uri with 'x', 'xx', or 'xxx' and still get a the groups returned for the matches for user and id say. This could save a potentially costly string creation.
You can use a lexer, yes - but any fast lexer implementation in a high-level language will actually use the very same technique described here. Generating a character-comparison + goto based FSM works great in C, but has bad performance in PHP (and other dynamic languages).
There's a common misconception that regular expressions are slow, but in many cases they offer much better performance than a manual implementation with lots of strpos etc. In the end PCRE does the same thing, but in C and using a JIT compiler. Apart from very simple cases you don't have a chance to keep up with that using PHP code.
Fast until you hit the catastrophic backtracking problem, and then your routing code spins indefinitely. I've hit a production issue of this form.
For this kind of problem you need to use actual regular expressions, not Perl's bastardization of the concept. Or produce your own DFA - it's not as hard as it looks, and gives you a much better idea what's going on.
You won't hit catastrophic backtracking problems with the types of regular expressions you need for routing.
Even if you do use some very weird expression, the worst that can happen is that you hit the backtracking limit and a route fails to match because of that. There won't be any code spinning indefinitely.
its interesting to see this. i don't often use regular expressions but for the most general case of this problem they are a good fit - however the specific example given, and most real world cases i imagine can be optimised further by doing away with regex altogether and using more traditional lookup and transformation approaches.
the example given can determine the route required by everything following /user/ - in the case that the 7th character of the string is a digit then we have the last case - all that remains to distinguish the other two is to count the number of slashes - if you want to distinguish bad matches then validating the extracted values will be necessary too. of course this kind of logic fails to scale to the general case - but for real world optimisation i can imagine it being much more fruitful (possibly you could do a match like i described faster than the regex function will parse its own input string before doing any matching at all!)
there is the smell of a very classical problem here though - "guess optimising". optimisation should be driven by measurement - identify the slow part then optimise it. timings that show the performance of the solutions overall are interesting and prove the point - but what would be more interesting imo would be 'this was the hot spot (point at timing), therefore i was right (point at new timing)'.
The chunking seems like a missed opportunity. It still uses a loop, so it's still O(n) in the number of routes. That is, 200 routes will take twice as long. 1000 routes will take 10 times as long. But maybe 100 routes is all we ever see in reality?
Regardless, it's a small leap from that to using a tree. I would like to have seen timings for that, and for larger numbers of routes.
Building a trie from a list of strings is trivial. Building one from a list of regexes is not. But you don't need the complexity of inspecting or interpreting the regexes.
With a regular binary tree, you'll need to compile multiple regexes. The root is a regex with all urls in only two groups: (a|b|c|d)|(e|f|g|h). If the first group is non-empty, then you move to the next node: (a|b)|(c|d). If the second group is non-empty, then you try (c)|(d). If the first group of that is non-empty, then the url was c. That should be O(k * log n) where k is the length of the url and n in the number of urls.
The grouping method requires 10 regex matches for 100 routes. 2^10 == 1024, so the tree method can do 10 times more routes in about the same amount of work. A million routes with a tree is only twice as hard as a thousand, or twice as hard as a hundred with chunking. A million routes with chunking in 10,000 times harder than a hundred.
On the other hand, the tree requires O(n * log n) memory, where n chunks of constant size requires only O(n) memory. Perhaps that's significant.
edit: misplaced asterisks triggered italics.
edit2: Actually, you don't even need to add or check groups. At each node you only need to combine half the regexes into one with no outer group. For example, the top node regex from above could be just a|b|c|d. If it matches, go to the left node (which is just a|b). If it doesn't match, go to the right node (which is just e|f). As an edge case, the right-most tip node will all need to have a regex to distinguish between it and non-matching urls. This also makes it easier to dynamically add routes without recompiling the whole tree. Just add the route to the right side of each node, which requires no work until you get to the rightmost tip. Occasionally rebalance. Is dynamically modifying the routes something that people do?
Or how about this: a single regex, but with groups that you can walk like a tree:
(((a)|(b))|((c)|(d))) | (((e)|(f))|((g)|(h)))
The number of groups will be O(n * log n). So 10-20 times the number of urls at most. It's certainly not quadratic.
The groups will in depth-first order. It's slightly complicated by the fact that the regexes contain capturing groups themselves, but you can precompute an offset table that accounts for them.
If you use names like that, walking the tree is easy. Just concatenate "0" or "1" to the current node node to find the child nodes. Also, the final matching group name is a binary number corresponding to matching route's index into the list of routes. For example, url "e" has the name "100". That's a binary 4, which is url "e"'s position in the list.
I built a router like this for Rails back in the 2.3 era. It was ridiculously fast for small route sets, but as the sets grew, the standard approach was actually faster. If your route set is small and routing is a bottleneck, though, this approach might work for you. :)
(I don't know how it would stand up against Journey, the Rails 4 router business. I imagine it's only gotten faster so there may be next to no gain)
Did you read on to the end? I also had the problem that performance degraded with large route sets, but resolved it by matching in chunks of ten routes.
Oh that is so neat how you bundle the whole thing up and then extract which one was hit! Bravo at exploiting the significant regex compiling technology to the max. That's one trick that I will remember.
It's not like you're writing the regular expressions and lookup tables per hand. You write the routes in whatever format you deem nice and it's automatically translated from there ;)
Because of regular expressions? I've seen a few frameworks do routing that way and it seemed clean to me. Are you saying nightmare because regexes are nightmarish, or something?