Regular expression (PCRE) for URL matching - php

The input: we get some plain text as input string and we have to highlighight all URLs there with <a href={url}>{url></a>.
For some time I've used regex taken from http://flanders.co.nz/2009/11/08/a-good-url-regular-expression-repost/, which I modified several times, but it's built for another issue - to check whether the whole input string is an URL or no.
So, what regex do you use in such issues?
UPD: it would be nice if answers were related to php :-[

Take a look at a couple of modules available on CPAN:
URI::Find
URI::Find::Schemeless
where the latter is a little more forgiving. The regular expressions are available in the source code (the latter's, for example).
For example:
#! /usr/bin/perl
use warnings;
use strict;
use URI::Find::Schemeless;
my $text = "http://stackoverflow.com/users/251311/zerkms is swell!\n";
URI::Find::Schemeless
->new(sub { qq[$_[0]] })
->find(\$text);
print $text;
Output:
http://stackoverflow.com/users/251311/zerkms is swell!

For Perl, I usually use one of the modules defining common regex, Regexp::Common::URI::*. You might find a good regexp for you in the sources of those modules.
http://search.cpan.org/search?query=Regexp%3A%3ACommon%3A%3AURI&mode=module

Related

Using php regex to translate output buffer, but not within HTML tags

I have an array with strings to translate ($translation), and I want to use it to translate the output buffer. However, it should not replace within html tags. I have tried using php DOM, but this is too slow and probably too complex for what I want to do.
I currently use this code, but this of course also translates between tags.
$output = ob_get_clean();
foreach($translation as $original => $translated) {
$output = str_replace($original,utf8_encode($translated),$output);
}
I guess I should use a regular expression to replace not within HTML tags, but I can't seem to find the correct expression to do this. Can anyone help? Thanks.
aside from opinions on the orginial idea:
i would not use regexp for that for performance reasen. you could utilize strpos($html,'<') + strpos($html,'>') in combination with substr to extract string by string.
But if somebody(including you) ever has to change the results at another point, then i suggest you go the extra mile and implement a 'proper' translation.
My recommendation:
look into gettext
filter out the strings like mentioned above and generate a .mo -file
encapsulate the strings between the tags with the gettext-functions (like here)

Writing A Regular Expression For Search And Replace Function In Dreamweaver

I am trying to update hundreds of lines of comments in my php files. My editor allows me to use regular expressions to perform a search and replace. However, I don't know much about regular expression to write it. Please refer to example below.
Dump($Data1, 'Library_reports.php - Get_Filtered_InventoryReport() - $Data1');
Dump($Data2, 'Library_reports.php - Get_Filtered2InventoryReport() - $Data2');
Dump($Data3, 'Library_reports.php - GetFilteredInventoryReport() - $Data3');
to be replace with
Dump($Data1, __METHOD__.' - $Data1');
Dump($Data2, __METHOD__.' - $Data2');
Dump($Data3, __METHOD__.' - $Data3');
So basically, I want to search for
'Some_Alphanumeric_string()
and then replace it with a
__METHOD__.'
Give it a try: [A-Za-z0-9_]() it's nothing complicated here.
Edit:
[A-Za-z0-9_]+\(\)
StackOverflow eats my backslashes :)
Search with:
([a-zA-Z0-9]+)\(\)
Replace with:
^ intentionally left blank
Based on your description, this search regex will do the trick:
\b[a-z0-9_]+\b\(\)
...assuming you do case insensitive search. (It's an option in the Dreamweaver search/replace tool).
Otherwise:
\b[A-Za-z0-9_]+\b\(\)
Note: I've included the underscore in the character class based on your use of them in:
"Some_Alphanumeric_string()"

Counterpart to PHP’s preg_match in Python

I am planning to move one of my scrapers to Python. I am comfortable using preg_match and preg_match_all in PHP. I am not finding a suitable function in Python similar to preg_match. Could anyone please help me in doing so?
For example, if I want to get the content between <a class="title" and </a>, I use the following function in PHP:
preg_match_all('/a class="title"(.*?)<\/a>/si',$input,$output);
Whereas in Python I am not able to figure out a similar function.
You looking for python's re module.
Take a look at re.findall and re.search.
And as you have mentioned you are trying to parse html use html parsers for that. There are a couple of option available in python like lxml or BeautifulSoup.
Take a look at this Why you should not parse html with regex
I think you need somthing like that:
output = re.search('a class="title"(.*?)<\/a>', input, flags=re.IGNORECASE)
if output is not None:
output = output.group(0)
print(output)
you can add (?s) at the start of regex to enable multiline mode:
output = re.search('(?s)a class="title"(.*?)<\/a>', input, flags=re.IGNORECASE)
if output is not None:
output = output.group(0)
print(output)
You might be interested in reading about Python Regular Expression Operations

Apply a list of regular expressions in PHP

I have a long list of regular expressions in an ignore.txt, and another long list in an include.txt file. What would be the quickest way to apply these using PHP against data contained in a sample.html file such that any matches found in include are captured, but then anything matching in ignore.txt is excluded?
If your include.txt and ignore.txt files are setup so that they're only regular expressions, and there's one expression per line, you can use PHP's file() function. That will load the files into an array where each individual line is an element in the array. You can use file_get_contents() to load the sample.html file in as a string.
preg_match() or preg_match_all() do not actually take arrays as input, like preg_replace() does. You will need to loop over your array of expressions using something like foreach and applying an individual call to one of the matching functions to get your results.
I think preg_match_all() will suit your needs best, because it sounds like you're wanting to pull all of the matches out of the entire file, not just the first one. Once you have your full list of matches, then you'd apply your filter using the data from ignore.txt in a similar manner.
the quickest way would be to let the shell do the job
$result = `cat sample.html | egrep -f include.txt | egrep -vf ignore.txt`;

Replace all "\" characters which are *not* inside "<code>" tags

First things first: Neither this, this, this nor this answered my question. So I'll open a new one.
Please read
Okay okay. I know that regexes are not the way to parse general HTML. Please take note that the created documents are written using a limited, controlled HTML subset. And people writing the docs know what they're doing. They are all IT professionals!
Given the controlled syntax it is possible to parse the documents I have here using regexes.
I am not trying to download arbitrary documents from the web and parse them!
And if the parsing does fail, the document is edited, so it'll parse. The problem I am addressing here is more general than that (i.e. not replace patterns inside two other patterns).
A little bit of background (you can skip this...)
In our office we are supposed to "pretty print" our documentation. Hence why some came up with putting it all into Word documents. So far we're thankfully not quite there yet. And, if I get this done, we might not need to.
The current state (... and this)
The main part of the docs are stored in a TikiWiki database. I've created a daft PHP script which converts the documents from HTML (via LaTeX) to PDF. One of the must have features of the selected Wiki-System was a WYSIWYG editor. Which, as expected leaves us with documents with a less then formal DOM.
Consequently, I am transliterating the document using "simple" regexes. It all works (mostly) fine so far, but I encountered one problem I haven't figured out on my own yet.
The problem
Some special characters need to replaced by LaTeX markup. For exaple, the \ character should be replaced by $\backslash$ (unless someone knows another solution?).
Except while in a verbatim block!
I do replace <code> tags with verbatim sections. But if this code block contains backslashes (as is the case for Windows folder names), the script still replaces these backslashes.
I reckon I could solve this using negative LookBehinds and/or LookAheads. But my attempts did not work.
Granted, I would be better off with a real parser. In fact, it is something on my "in-brain-roadmap", but it is currently out of the scope. The script works well enough for our limited knowledge domain. Creating a parser would require me to start pretty much from scratch.
My attempt
Example Input
The Hello \ World document is located in:
<code>C:\documents\hello_world.txt</code>
Expected output
The Hello $\backslash$ World document is located in:
\begin{verbatim}C:\documents\hello_world.txt\end{verbatim}
This is the best I could come up with so far:
<?php
$patterns = array(
"special_chars2" => array( '/(?<!<code[^>]*>.*)\\\\[^$](?!.*<\/code>)/U', '$\\backslash$'),
);
foreach( $patterns as $name => $p ){
$tex_input = preg_replace( $p[0], $p[1], $tex_input );
}
?>
Note that this is only an excerpt, and the [^$] is another LaTeX requirement.
Another attempt which seemed to work:
<?php
$patterns = array(
"special_chars2" => array( '/\\\\[^$](?!.*<\/code>)/U', '$\\backslash$'),
);
foreach( $patterns as $name => $p ){
$tex_input = preg_replace( $p[0], $p[1], $tex_input );
}
?>
... in other words: leaving out the negative lookbehind.
But this looks more error-prone than with both lookbehind and lookahead.
A related question
As you may have noticed, the pattern is ungreedy (/.../U). So will this match only as little possible inside a <code> block? Considering the look-arounds?
If me, I will try to find HTML parser and will do with that.
Another option is will try to chunk the string into <code>.*?</code> and other parts.
and will update other parts, and will recombine it.
$x="The Hello \ World document is located in:\n<br>
<code>C:\documents\hello_world.txt</code>";
$r=preg_split("/(<code>.*?<\/code>)/", $x,-1,PREG_SPLIT_DELIM_CAPTURE);
for($i=0;$i<count($r);$i+=2)
$r[$i]=str_replace("\\","$\\backslash$",$r[$i]);
$x=implode($r);
echo $x;
Here is the results.
The Hello $\backslash$ World document is located in:
C:\documents\hello_world.txt
Sorry, If my approach is not suitable for you.
I reckon I could solve this using negative LookBehinds and/or LookAheads.
You reckon wrong. Regular expressions are not a replacement for a parser.
I would suggest that you pipe the html through htmltidy, then read it with a dom-parser and then transform the dom to your target output format. Is there anything preventing your from taking this route?
Parser FTW, ok. But if you can't use a parser, and you can be certain that <code> tags are never nested, you could try the following:
Find <code>.*?</code> sections of your file (probably need to turn on dot-matches-newlines mode).
Replace all backslashes inside that section with something unique like #?#?#?#
Replace the section found in 1 with that new section
Replace all backslashes with $\backslash$
Replace als <code> with \begin{verbatim} and all </code> with \end{verbatim}
Replace #?#?#?# with \
FYI, regexes in PHP don't support variable-length lookbehind. So that makes this conditional matching between two boundaries difficult.
Pandoc? Pandoc converts between a bunch of formats. you can also concatenate a bunch of flies together then covert them. Maybe a few shell scripts combined with your php scraping scripts?
With your "expected input" and the command pandoc -o text.tex test.html the output is:
The Hello \textbackslash{} World document is located in:
\verb!C:\documents\hello_world.txt!
pandoc can read from stdin, write to stdout or pipe right into a file.
Provided that your <code> blocks are not nested, this regex would find a backslash after ^ start-of-string or </code> with no <code> in between.
((?:^|</code>)(?:(?!<code>).)+?)\\
| | |
| | \-- backslash
| \-- least amount of anything not followed by <code>
\-- start-of-string or </code>
And replace it with:
$1$\backslash$
You'd have to run this regex in "singleline" mode, so . matches newlines. You'd also have to run it multiple times, specifying global replacement is not enough. Each replacement will only replace the first eligible backslash after start-of-string or </code>.
Write a parser based on an HTML or XML parser like DOMDocument. Traverse the parsed DOM and replace the \ on every text node that is not a descendent of a code node with $\backslash$ and every node that is a code node with \begin{verbatim} … \end{verbatim}.

Categories