Developing regular expressions in an ad hoc sandbox

మార్చు

Regular expressions are little computer programs, so it is characteristic of regex searches that they must be written while studying the target data, and tested to achieve their potential precision and thoroughness. However, only a few of these intensive searches are technically able to run at a time against the database.[1] A sandbox minimizes your footprint, and guarantees that you will never run an untested regexp on every namespace in the wiki, even if your default search would let you do that.

Although a normal search targeting the entire wiki will run quickly, a regexp search should target as few pages as possible by using filters in order to run quickly. A filter is part or whole of a database query. Filters include:

  • word(s) or phrase
  • intitle:
  • incategory:
  • hastemplate:
  • prefix: (always at the end)
  • linksto:
  • namespace: (always at the beginning)
  • insource:"word1 word2"
  • insource:word

Order is not important because the search is optimized by the software before it is run.

To target just one page while experimenting with or developing a regex search, target a fullpagename. From the search box use the filter prefix:fullpagename. From the edit box (of any section of the page with the target data), you can always just write prefix:{{FULLPAGENAME}} and it will "expand" for you to the fullpagename. Although you can edit a history page, technically a "history page" is not a page (in the database), and so {{FULLPAGENAME}} there will point to the database version (not its own rendering). For the same reason, you cannot search for the wikitext on a page that is not already saved (to the database), although you can certainly change the search parameters again and again with no need to save them.

Fullpagename is namespace:pagename. Knowing this you can adjust your Prefix parameter. Although prefix can filter down to one page, it can filter up to a namespace, and it also accepts the beginning letter(s) of set of pagenames if you want to reduce the namespace search domain.

Regex sandboxing uses an ad hoc sandbox made by editing any page containing the target data, and using it as a "sandbox" (not editing it to save it). It then develops by using adding a search link that includes insource:/regexp/, with the filter prefix:{{FULLPAGENAME}} alongside.

Use of a sandbox enables the smallest possible footprint by using filters to limit the search domain. Once your regexp pattern is honed, you increase the search domain. A regex search is best run with filters, not alone even if it is a polished rexexp.

Sandboxing procedure

మార్చు

Rather than use the search box, where entering an equals sign and a pipe character, and "quotes around phrases" is a straightforward matter, it is still easiest to use a regex-based search-link template — {{regex}} or {{tlusage}} — on the page with sample data, because then you can focus on the target data there and on writing the regexp pattern. It is easier, that is, if you already understand how templates "escape" the pipe character and the equals sign. See Help:Template#Parameters for other important details.

The procedure here is an iterative, read-evaluate-modify cycle. Regex development requires that you study the target data while writing and rewriting its pattern.

  1. Navigate to a page with the wikitext instances you are interested in mining. Or create one yourself, and save it to the database so the query will find it.
  2. Open the wikitext, and enter a {{regex}} or {{tlusage}}.
  3. Show preview, and activate the search link. On the search results page, note the bold text in each match.
  4. Go back in your browser. Modify the regexp, and cycle until done. (Or don't go back, you may want to modify the query at the search box.)
  5. Expand the search domain, and test the accuracy of those results. You can trim or expand the number of the results using prefix:.

Caveat emptor: if you change the target for an immediate retesting, you'll have to save and purge, but not if you just change the regexp.


As an ad hoc sandbox, you can show the wikitext of a section like this, (already saved in the database), modify some of the patterns in the regex-search-link template calls on this page, do a Show Preview, and see what matches when you click on the newly formed regex search-link, all quite safely, and without changing a thing in the database.

The template calls that produce "ft/s, 2 sq ft, 3 m/s, 4 m*s-2, 5 ft.s-2, 6 °C/J, and J/C" appear in the wikitext of this section like this:

  1. {{val|1|ul=ft/s|fmt = commas}}
  2. {{val|2|u=ft2}}
  3. {{val|3|u=m/s| fmt =commas }}
  4. {{val|4|u=m*s-2}}
  5. {{val|5|u=ft.s-2}}
  6. {{val|6|u=C/J}}
  7. {{val|7|ul=J/C}}

Note how the above targets are |numbered|, then click on the links below.

Query Search link Answer
Q1 Using {{search link}}, does this page employ template Val ? {{sl|hastemplate: Val}}hastemplate: Val A. No, because this pagename is in Help not Article space.(Search link default). 1300 search results.
Q2 Using {{search link}} responsibly, does this page use Val's fmt parameter? {{sl|insource:/\{[Vv]al\{{!}}[^}]*fmt/ prefix:{{FULLPAGENAME}}}}

insource:/\{[Vv]al\|[^}]*fmt/ prefix:సహాయం:Searching/Regex/Sandboxing

A2.1. Look for 1 and 3 in the search results in bold text. (Adds an appropriate filter.)
Using {{regex}} instead... {{slre|\{[Vv]al\{{!}}[^}]*fmt}}

insource:/\{[Vv]al\|[^}]*fmt/ prefix:సహాయం:Searching/Regex/Sandboxing

A2.2 Less typing than {{search link}}.
Using {{template usage}} instead... {{tlre|Val|pattern=fmt}}

Testing fmt on this page

A2.3 Easiest for templates.
Q3. Who uses u=ft OR ul=ft? (one-letter differs) {{regex|ul?=ft}}

insource:/ul?=ft/ prefix:సహాయం:Searching/Regex/Sandboxing

A. Look for 1, 2, and 5 in bold text.
Using {{template usage}}... {{tlre|val|pattern = ul?=ft}}

Testing ul?=ft on this page

Finds same pattern, but only inside a Val template.
Q4. AND of these, who also uses fmt=commas after that? {{slre|ul?=ft.*commas}}

insource:/regexp/ prefix:సహాయం:Searching/Regex/Sandboxing

A. No context shown, but article title is shown. A half a Bug?
Who has one space before the word "commas"? {{slre|. commas}}insource:/. commas/ prefix:సహాయం:Searching/Regex/Sandboxing A. 1 but not 2.
Q5. Who uses either u or ul with "ft" OR uses "fmt=commas". {{slre|(ul? *= *ft{{!}}fmt *= *commas)}}

insource:/regexp/ prefix:సహాయం:Searching/Regex/Sandboxing

A. 1, 2, 3, and 5. (The pattern matches all possible spacing.)
Q6. Who uses ft or m, in |u= or |ul=? {{slre|ul? *{{=}} *(ft{{!}}m)}}

insource:/ul? *= *(ft|m)/ prefix:సహాయం:Searching/Regex/Sandboxing

A. 1, 2, 3, 4, and 5.

Used {{!}} for the alternation metacharacter. Used {{=}}. (Could have used named 1 = or nicely named pattern = .)

Q7. Who uses . or * in the unit code? {{tlre|val|pattern = u *= *(\.{{!}}\*)/}}

Testing u *= *(\.|\*)/ on this page

A. 4 and 5.
Who uses a pipe? {{regex|\|}}insource:/\/ prefix:సహాయం:Searching/Regex/Sandboxing All of them
Q8. Who uses / or - within the |u= or |ul= paramter? {{tlre|val|ul? *= *[^{{!}}}]+(\/{{!}}-)}}

Testing ul? *= *[^|}]+(\/|-) on this page

A. 1,3,4,5,6 and 7.
Q9. Where is Val used in the template namespace for numbers only, (no u, ul, up, or upl parameters). {{tlre|val|pattern = ~(u[lp].)|prefix = 10}}

hastemplate:"val" insource:/\{\{ *[Vv]al *\|[^}]*~(u[lp].)/ prefix:మూస:

A. In the 30 or so templates listed.
Q10. Which articles use {{Convert}}'s and(-) option? {{tlre|convert|pattern=and\(-\)| prefix=0}}

hastemplate:"convert" insource:/\{\{ *[Cc]onvert *\|[^}]*and\(-\)/ prefix::

A Coast Range Arc and Skipjack shad

In Q2, notice how the MediaWiki software ignores the spaces around parameters, but how in Q4 the same MediaWiki software processes the spaces inside parameters. Q2 might have been solved with a plain insource:val fmt search because "fmt" and "val" are whole words, and fmt is rarely seen apart from inside Val. How about hastemplate:val insource:fmt?