I am still relatively new to Regular Expressions and feel My code is being too greedy. I am trying to add an id attribute to existing links in a piece of code. My functions is like so:
function addClassHref($str) {
//$str = stripslashes($str);
$preg = "/<[\s]*a[\s]*href=[\s]*[\"\']?([\w.-]*)[\"\']?[^>]*>(.*?)<\/a>/i";
preg_match_all($preg, $str, $match);
foreach ($match[1] as $key => $val) {
$pattern[] = '/' . preg_quote($match[0][$key], '/') . '/';
$replace[] = "<a id='buttonRed' href='$val'>{$match[2][$key]}</a>";
}
return preg_replace($pattern, $replace, $str);
}
This adds the id tag like I want but it breaks the hyperlink. For example:
If the original code is : Link
Instead of <a id="class" href="http://www.google.com">Link</a>
It is giving
<a id="class" href="http">Link</a>
Any suggestions or thoughts?
Do not use regular expressions to parse XML or HTML.
$doc = new DOMDocument();
$doc->loadHTML($html);
$all_a = $doc->getElementsByTagName('a');
$firsta = $all_a->item(0);
$firsta->setAttribute('id', 'idvalue');
echo $doc->saveHTML($firsta);
You've got some overcomplications in your regex :)
Also, there's no need for the loop as preg_replace() will hit all the instances of the search pattern in the relevant string. The first regex below will take everything in the a tag and simply add the id attribute on at the end.
$str = 'Link' . "\n" .
'Link' . "\n" .
'Link';
$p = "{<\s*a\s*(href=[^>]*)>([^<]*)</a>}i";
$r = "<a $1 id=\"class\">$2</a>";
echo preg_replace($p, $r, $str);
If you only want to capture the href attribute you could do the following:
$p = '{<\s*a\s*href=["\']([^"\']*)["\'][^>]*>([^<]*)</a>}i';
$r = "<a href='$1' id='class'>$2</a>";
Your first subpattern ([\w.-]*) doesn't match :, thus it stops at "http".
Couldn't you just use a simple str_replace() for this? Regex seems like overkill if this is all you're doing.
$str = str_replace('<a ', '<a id="someID" ', $str);
Related
How can I write a regex expression that will convert any absolute URLs to relative paths. For example:
src="http://www.test.localhost/sites/
would become
src="/sites/"
The domains are not static.
I can't use parse_url (as per this answer) because it is part of a larger string, that contains no-url data as well.
Solution
You can use the following regex:
/https?:\/{2}[^\/]+/
Which would match the following:
http://www.test.localhost/sites/
http://www.domain.localhost/sites/
http://domain.localhost/sites/
So it would be:
$domain = preg_replace('/https?:\/{2}[^\/]+/', '', $domain);
Explanation
http: Look for 'http'
s?: Look for an 's' after the 'http' if there's one
: : Look for the ':' character
\/{2}: Look for the '//'
[^\/]+: Go for anything that is not a slash (/)
My guess is that maybe this expression or an improved version of that might work to some extent:
^\s*src=["']\s*https?:\/\/(?:[^\/]+)([^"']+?)\s*["']$
The expression is explained on the top right panel of this demo if you wish to explore/simplify/modify it.
Test
$re = '/^\s*src=["\']\s*https?:\/\/(?:[^\/]+)([^"\']+?)\s*["\']$/m';
$str = 'src=" http://www.test.localhost/sites/ "
src=" https://www.test.localhost/sites/"
src=" http://test.localhost/sites/ "
src="https://test.localhost/sites/ "
src="https://localhost/sites/ "
src=\'https://localhost/ \'
src=\'http://www.test1.test2.test3.test4localhost/sites1/sites2/sites3/ \'';
$subst = 'src="$1"';
var_export(preg_replace($re, $subst, $str));
Output
src="/sites/"
src="/sites/"
src="/sites/"
src="/sites/"
src="/sites/"
src="/"
src="/sites1/sites2/sites3/"
RegEx Circuit
jex.im visualizes regular expressions:
$dom = new DOMDocument;
$dom->loadHTML($yourHTML)
$xp = new DOMXPath($dom);
foreach($xp->query('//#src') as $attr) {
$url = parse_url($attr->nodeValue);
if ( !isset($url['scheme']) || stripos($url['scheme'], 'http']) !== 0 )
continue;
$src = $url['path']
. ( isset($url['query']) ? '?' . $url['query'] : '' )
. ( isset($url['fragment']) ? '#' . $url['fragment'] : '' );
$attr->parentNode->setAttribute('src', $src);
}
$result = $dom->saveHTML();
I added an if condition to skip cases when it isn't possible to say if the beginning of the src attribute is a domain or the beginning of the path. Depending of what you are trying to do, you can remove this test.
If you are working with parts of an html document (ie: not a full document), you have to change $result = $dom->saveHTML() with something like:
$result = '';
foreach ($dom->getElementsByTagName('body')->item(0)->childNodes as $childNode) {
$result . = $dom->saveHTML($childNode);
}
Here is the html:
<td width="551">
<p><strong>Full Time Faculty<br>
<strong></strong>Assistant Professor</strong></p>Doctorate of Business Administration<br><br>
<strong>Phone</strong>: +88 01756567676<br>
<strong>Email</strong>: frank.wade#email.com<br>
<strong>Office</strong>: NAC739<br>
<br><p><b>Curriculum Vitae</b></p></td>
The output I want is:
+88 01756567676
frank.wade#email.com
NAC739
I used simple_html_dom to parse the data.
Here's the code I wrote. It works if the contact info part is wrapped with a paragraph tag. ()
$contact = $facultyData->find('strong[plaintext^=Phone]');
$contact = $contact[0]->parent();
$element = explode("\n", strip_tags($contact->plaintext));
$regex = '/Phone:(.*)/';
if (preg_match($regex, $element[0], $match))
$phone = $match[1];
$regex = '/Email:(.*)/';
if (preg_match($regex, $element[1], $match))
$email = $match[1];
$regex = '/Office:(.*)/';
if (preg_match($regex, $element[2], $match))
$office = $match[1];
Is there any way to get those 3 lines by matching with tag?
maybe you could use xpath function like
$xml = new SimpleXMLElement($DomAsString);
$theText = $xml->xpath('//strong[. ="Phone"]/following-sibling::text()');
some snippings to remove the ': ', and of course fixing the dom structure
Or just use straight regex:
preg_match('|Phone</strong>: [^<]+|', $str, $m) or die('no phone');
$phone = $m[1];
You really don't need to parse this as HTML or deal with DOM tree. You can explode your HTML string into pieces, then remove what is extra in each piece to get what you want:
<?php
$str = <<<str
<td width="551">
<p><strong>Full Time Faculty<br>
<strong></strong>Assistant Professor</strong></p>Doctorate of Business Administration<br><br>
<strong>Phone</strong>: +88 01756567676<br>
<strong>Email</strong>: frank.wade#email.com<br>
<strong>Office</strong>: NAC739<br>
<br><p><b>Curriculum Vitae</b></p></td>
str;
// We explode $str and use '</strong>' as delimiter and get only the part of result that we need
$lines = array_slice(explode('</strong>', $str), 3, 3);
// Define a function to remove extra text from left and right of our so called lines
function stripLine($line) {
// ltrim ' ;' characters and remove everything after (and including) '<br>'
return preg_replace('/<br>.*/is', '', ltrim($line, ' :'));
}
$lines = array_map('stripLine', $lines);
print_r($lines);
See code output here.
I have the following regex :
$string = preg_replace("/([\w]+:\/\/[\w-?&;#~=\.\/\#]+[\w\/])/i","<a target=\"_blank\" href=\"$1\">$1</A>",$string);
Using it to parse this string : http://www.ttt.com.ar/hello_world
Produces this new string :
<a target="_blank" href="http://www.ttt.com.ar/hello_world">http://www.ttt.com.ar/hello_world</A>
So far , soo good. What I want to do is to get replacement $1 to be a substring of $1 producing an output like :
<a target="_blank" href="http://www.ttt.com.ar/hello_world">http://www.ttt.com.ar/...</A>
Pseudocode of what I mean:
$string = preg_replace("/([\w]+:\/\/[\w-?&;#~=\.\/\#]+[\w\/])/i","<a target=\"_blank\" href=\"$1\">substring($1,0,24)..</A>",$string);
Is this even possible? Probably Im just doing all wrong :)
Thanks in advance.
Check out preg_replace_callback():
$string = 'http://www.ttt.com.ar/hello_world';
$string = preg_replace_callback(
"/([\w]+:\/\/[\w-?&;#~=\.\/\#]+[\w\/])/i",
function($matches) {
$link = $matches[1];
$substring = substr($link, 0, 24) . '..';
return "<a target=\"_blank\" href=\"$link\">$substring</a>";
},
$string
);
var_dump($string);
// <a target="_blank" href="http://www.ttt.com.ar/hello_world">http://www.ttt.com.ar/...</a>
Note, you can also use the e modifier in PHP to execute functions in your preg_replace(). This has been deprecated in PHP 5.5.0, in favor of preg_replace_callback().
You can use a capturing group inside of a lookahead like this:
preg_replace(
"/((?=(.{24}))[\w]+:\/\/[\w-?&;#~=\.\/\#]+[\w\/])/i",
"<a target=\"_blank\" href=\"$1\">$2..</A>",
$string);
This will capture the entire URL in group 1, but it will also capture the first 24 characters of it in group 2.
You are showing bad practice. Regexes should not being used to parse or modify xml content from application's context.
Suggests:
Use a DOM parsing to read and modify the value
use parse_url() to get the protocol + domain name
Example:
$doc = new DOMDocument();
$doc->loadHTML(
'<a target="_blank" href="http://www.ttt.com.ar/hello_world">http://www.ttt.com.ar/hello_world</A>'#
);
$link = $doc->getElementsByTagName('a')->item(0);
$url = parse_url($link->nodeValue);
$link->nodeValue = $url['scheme'] . '://' . $url['host'] . '/...';
echo $doc->saveHTML();
I need to perform a recursive str_replace on a portion of HTML (with recursive I mean inner nodes first), so I wrote:
$str = //get HTML;
$pttOpen = '(\w+) *([^<]{1,100}?)';
$pttClose = '\w+';
$pttHtml = '(?:(?!(?:<x-)).+)';
while (preg_match("%<x-(?:$pttOpen)>($pttHtml)*</x-($pttClose)>%m", $str, $match)) {
list($outerHtml, $open, $attributes, $innerHtml, $close) = $match;
$newHtml = //some work....
str_replace($outerHtml, $newHtml, $str);
}
The idea is to first replace non-nested x-tags.
But it only works if innerHtml in on the same line of the opening tag (so I guess I misunderstood what the /m modifier does). I don't want to use a DOM library, because I just need simple string replacement. Any help?
Try this regex:
%<x-(?P<open>\w+)\s*(?P<attributes>[^>]*)>(?P<innerHtml>.*)</x-(?P=open)>%s
Demo
http://regex101.com/r/nA2zO5
Sample code
$str = // get HTML
$pattern = '%<x-(?P<open>\w+)\s*(?P<attributes>[^>]*)>(?P<innerHtml>.*)</x-(?P=open)>%s';
while (preg_match($pattern, $str, $matches)) {
$newHtml = sprintf('<ns:%1$s>%2$s</ns:%1$s>', $matches['open'], $matches['innerHtml']);
$str = str_replace($matches[0], $newHtml, $str);
}
echo htmlspecialchars($str);
Output
Initially, $str contained this text:
<x-foo>
sdfgsdfgsd
<x-bar>
sdfgsdfg
</x-bar>
<x-baz attr1='5'>
sdfgsdfg
</x-baz>
sdfgsdfgs
</x-foo>
It ends up with:
<ns:foo>
sdfgsdfgsd
<ns:bar>
sdfgsdfg
</ns:bar>
<ns:baz>
sdfgsdfg
</ns:baz>
sdfgsdfgs
</ns:foo>
Since, I didn't know what work is done on $newHtml, I mimic this work somehow by replacing x-with ns: and removing any attributes.
Thanks to #Alex I came up with this:
%<x-(?P<open>\w+)\s*(?P<attributes>[^>]*?)>(?P<innerHtml>((?!<x-).)*)</x-(?P=open)>%is
Without the ((?!<x-).)*) in the innerHtml pattern it won't work with nested tags (it will first match outer ones, which isn't what I wanted). This way innermost ones are matched first. Hope this helps.
I don't know exactly what kind of changes you are trying to do, however this is the way I will proceed:
$pattern = <<<'EOD'
~
<x-(?<tagName>\w++) (?<attributes>[^>]*+) >
(?<content>(?>[^<]++|<(?!/?x-))*) #by far more efficient than (?:(?!</?x-).)*
</x-\g<tagName>>
~x
EOD;
function callback($m) { // exemple function
return '<n-' . $m['tagName'] . $m['attributes'] . '>' . $m['content']
. '</n-' . $m['tagName'] . '>';
};
do {
$code = preg_replace_callback($pattern, 'callback', $code, -1, $count);
} while ($count);
echo htmlspecialchars(print_r($code, true));
I'm not sure how to explain this, so I'll show it on my code.
First and
Second and
Third
how can I delete opening and closing but not the rest?
I'm asking for preg_replace(); and I'm not looking for DomDocument or others methods to do it. I just want to see example on preg_replace();
how is it achievable?
Only pick the groups you want to preserve:
$pattern = '~()([^<]*)()~';
// 1 2 3
$result = preg_replace($pattern, '$2', $subject);
You find more examples on the preg_replace manual page.
Since you asked me in the comments to show any method of doing this, here it is.
$html =<<<HTML
First and
Second and
Third
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$elems = $xpath->query("//a[#class='delete']");
foreach ($elems as $elem) {
$elem->parentNode->removeChild($elem);
}
echo $dom->saveHTML();
Note that saveHTML() saves a complete document even if you only parsed a fragment.
As of PHP 5.3.6 you can add a $node parameter to specify the fragment it should return - something like $xpath->query("/*/body")[0] would work.
$pattern = '/<a (.*?)href=[\"\'](.*?)\/\/(.*?)[\"\'](.*?)>(.*?)<\/a>/i';
$new_content = preg_replace($pattern, '$5', $content);
$pattern = '/<a[^<>]*?class="delete"[^<>]*?>(.*?)<\/a>/';
$test = 'First and Second and Third';
echo preg_replace($pattern, '$1', $test)."\n";
$test = 'First and <b class="delete">seriously</b> and Third';
echo preg_replace($pattern, '$1', $test)."\n";
$test = 'First and <b class="delete">seriously</b> and Third';
echo preg_replace($pattern, '$1', $test)."\n";
$test = 'First and <a class="delete" href="url2.html">Second</a> and Third';
echo preg_replace($pattern, '$1', $test)."\n";
preg_replace('#(.+?)#', '$1', $html_string);
It is important to understand this is not an ideal solution. First, it requires markup in this exact format. Second, if there were, say, a nested anchor tag (albeit unlikely) this would fail. These are some of the many reasons why Regular Expressions should not be used for parsing/manipulating HTML.