What does these regular expressions mean in this code? - php

I was trying to change some parts of a joomla plugin, when I faced this part of it and I have no idea what it's doing.
Can someone please explain to me what these regular expressions and those ${4} do?
$comStart = '';
$comEnd = '';
$output = JResponse::getBody();
$output = preg_replace('/\<meta name=\"og\:/', '<meta property="og:', $output);
$output = preg_replace('/\<meta name=\"fb:admins/', '<meta property="fb:admins', $output);
$output = preg_replace('/<(\w+) (\w+)="(\w+):(\w+)" (\w+)="([a-zA-Z0-9\ \_\-\:\.\&\/\,\=\!\?]*)" \/>/i', $comStart.'<${1} ${2}="${3}:${4}" ${5}="${6}" >'.$comEnd, $output);
FYI: This plugin is for displaying facebook and opengraph tags inside articles.

SERIOUS NOTE!
The use of regular expressions to parse/match HTML/XML is highly
discouraged. Seriously, don't do it
Basically, it's a regular expression to parse/match HTML. Which may have slight side effects of not working, hard to maintain, and insanity.
The ${N} ones are called back-reference, they reference to the Nth brackets matched in the regular expressions.
If you require to do manipulation of HTML strings in PHP, you should use the DOMDocument class which was made exactly for this.
Example
<?php
$html_string = <<<HTML
<!DOCTYPE HTML>
<html lang="en-US">
<head>
<meta charset="UTF-8">
<title></title>
</head>
<body>
<div id="target">
This is the target DIV! <span>This span will change texts!</span>
</div>
</body>
</html>
HTML;
$dom = new DOMDocument();
// Loading HTML from string...
$dom->loadHTML($html_string);
//Retrieve target and span elements
$target = $dom->getElementById("target");
$span = $target->getElementsByTagName("span")->item(0);
//Remove text, firstChild is the text node.
$span->removeChild($span->firstChild);
//Append new text
$span->appendChild(new DOMText("This is the new text!"));
//Change an attribute
$span->setAttribute("class", "spanny");
//Save HTML to string
$html_string = $dom->saveHTML();
echo $html_string;
Regular Expressions aren't bad, evil, or scary, they are simply the wrong tool for the job, you don't stick a nail with a jackhammer do you?

$output = preg_replace('/\<meta name=\"og\:/', '<meta property="og:', $output);
Replace the string <meta name="og: with <meta property="og:. Kind of pointless - regex is not needed here.
$output = preg_replace('/\<meta name=\"fb:admins/', '<meta property="fb:admins', $output);
Replace <meta name="fb:admins with <meta property="fb:admins. Just as pointless - regex is not needed here.
$output = preg_replace('/<(\w+) (\w+)="(\w+):(\w+)" (\w+)="([a-zA-Z0-9\ \_\-\:\.\&\/\,\=\!\?]*)" \/>/i', $comStart.'<${1} ${2}="${3}:${4}" ${5}="${6}" >'.$comEnd, $output);
Replace a string like <word1 word2="word3:word4" word5="word6withspecialcharacterslike-:.etc." /> with <word1 word2="word3:word4" word5=word6withspecialcharacterslike-:.etc." >. So it only removes a trailing slash before the closing >. Very suspect and Voodoo-like use of regex.
Also, all those regexes are highly inelegant (lots of pointless escapes, for example) and show that whoever wrote those doesn't know much about regexes. Letting something like this loose on HTML is asking for trouble.
AVOID! AVOID! AVOID!

Each (\w+) says find a word and store it. So you are doing this (in pseudocode)
find /(word1) (word2)="(word3)" (word4)="(manypossiblechars5)"/ignoring case
replace pattern with $comStart.<word1 word2="word3:word4" manypossiblechars5="word6">.$comEnd

The first one tries to replace tags of the form <meta name="og:... with <meta property="og:...
The second similarly replaces tags starting <meta name="fb:admins... with <meta property="fb:admins...
Finally, the third seems to take tags of the form <word word="word:word" word="something" \/> and wraps them with $comStart and $comEnd.
This is done by matching the parts of the tag (placing () around them) and then using backreferences such as ${4} to refer to the 4th matched part.
Here $comStart and $comEnd are set to '' so that seems a little pointless. It also manages to get rid of the closing slash for the tag at the same time, though who knows if that is intentional!

Those expressions attempt to fix the document head code by:
rewriting <meta name="og:*" to `
rewriting <meta name="fb:admins" to <meta property="fb:admins"
rewriting meta tags with a dangling slash to one without it (assuming it will always have two attributes.
This is just horrendous code, and as long as your templates don't have
those "mistakes" in them, you can throw this crap away.

Related

PHP get text from tag with regex

i want to get all text from thiw blow tag and put thats into array with regex
<div class="titr2">TEXT </div>
TEXT is utf-8 and i can not get that with using regex
<meta charset='UTF-8' />
<?php
error_reporting(1);
$handle='http://www.namefa.ir/Names.asp?pn=3&sx=F&fc=%D8%A8';
$handle = file_get_contents($handle);
preg_match_all('<div class="titr2" href=".*">(.*)</div>)siU', $string, $matching_data);
print_r($matching_data);
?>
Try to use this regexp:
preg_match_all('/<div[^>]+class="titr2"[^>]*>\s*<a[^>]+>(.*?)<\/a>\s*<\/div>/si', $handle, $matching_data);
You shouldn't use regex to parse HTML: RegEx match open tags except XHTML self-contained tags
You should really use an HTML parser instead.
If this really is a one-time thing, limited to this case only, in a small HTML file that never changes, your regex is wrong:
<div class="titr2">(.+?)</div>
would be closer and you should checkout Victor's solution.

php to strip and replace html meta tags

I'm not very familiar with PHP overall, but I'm trying to find a way to change some parts of the following documents:
index.html
hidden.html
Both pages are using the same header include file containing <meta name="robots" content="index, follow">
but I want to use PHP for some pages to strip and replace the entire line, and replace with something like <meta name="robots" content="none">
Can anyone provide an example of just stripping such a thing, and also strip+replace?
Would be greatly appreciated.
You can use str_replace to find and replace strings. See http://php.net/manual/en/function.str-replace.php
str_replace("I want to replace this", "with this");

Regular expression to get page title

There are lots of answers to this question, but not a single complete one:
With using one regular expression, how do you extract page title from <title>Page title</title>?
There are several other cases how title tags are typed, such as:
<TITLE>Page title</TITLE>
<title>
Page title</title>
<title>
Page title
</title>
<title lang="en-US">Page title</title>
...or any combination of above.
And it can be on its own line or in between other tags:
<head>
<title>Page title</title>
</head>
<head><title>Page title</title></head>
Thanks for help in advance.
UDPATE: So, the regex approach might not be the best solution to this. Which PHP based HTML parser could handle all scenarios, where HTML is well formed (or not so well)?
UPDATE 2: sp00m's regex (https://stackoverflow.com/a/13510307/1844607) seems to be working in all cases. I'll get back to this if needed.
Use a HTML parser instead. But in case of:
<title[^>]*>(.*?)</title>
Demo
Use the DOMDocument class:
$doc = new DOMDocument();
$doc->loadHTML($html);
$titles = $doc->getElementsByTagName("title");
echo $titles->item[0]->nodeValue;
Use this regex:
<title>[\s\S]*?</title>

PHP scraper - regular expressions

I'm trying to follow a tutorial for web scraping with php.
I understand roughly whats going on, but I don't get how to filter what has been scraped to get exactly what I want. For example:
<?php
$file_string = file_get_contents('page_to_scrape.html');
preg_match('/<title>(.*)<\/title>/i', $file_string, $title);
$title_out = $title[1];
?>
I see that the (.*) will retrieve everything in between title tags, can I use regular expressions to get specific info. Say inside he title had Welcome visitor #100 how would I get the number that comes after the hash?
Or do I have to retrieve everything between the tags then manipulate it later?
Given the title "Welcome visitor #100" and the fact a <title> tag occurs no more than once, the expression should be:
preg_match('~<title>Welcome visitor #(\d+)</title>~', ...);
A lot of people on SO would argue to never use regular expressions to parse (X)HTML; for this task, however, the above should suffice.
Although - as mentioned before - a <title> tag (should) occur no more than once, the pattern
<title>(.*)</title>
would as well match this:
<title>Welcome visitor <title>#<title>100blafoobar</title>
(.*) being the part allowing this. As soon as the page you're scraping your data from changes, the regex might stop working.
EDIT: A method to correctly sift out multiple elements and their attributes:
$dom = new DomDocument;
$dom->loadHTML($page_content);
$elements = $dom->getElementsByTagName('a');
for ($n = 0; $n < $elements->length; $n++) {
$item = $elements->item($n);
$href = $item->getAttribute('href');
}
You would just need to change the regex to match whatever you need. If you are going to use the tile more than once it's better to save the whole and manipulate it later, otherwise just get what you need.
/<title>.*((?<=#)\d*).*<\/title>/i
Would specifically match a number after a hash. It would not match a number without a hash.
There are many ways to write regex, it depends on how general or specific you want to be.
You could also write like this to get any number:
/<title>.*(\d)*.*<\/title>/i
I would first fetch the title tag and then process the title further. The other answers contain perfectly valid solutions for this task.
Some further notes:
Please use DOMDocument for such things, since it is much safer (your regular expression might break on some specific HTML pages)
Please use the non-greedy version of .*: .*?, otherwise you will run into funny things like:
<html>
<head>
<title>a</title>
</head>
<body>
<title>test</title> <!-- not allowed in HTML, but since when does the web pages online actually care about that? -->
</body>
</html>
You will now match everything between <title>a</title>... up to <title>test</title>, including everything in between.

html to text with domdocument class

How to get a html page source code without htl tags?
For example:
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
<meta http-equiv="content-language" content="hu"/>
<title>this is the page title</title>
<meta name="description" content="this is the description" />
<meta name="keywords" content="k1, k2, k3, k4" />
start the body content
<!-- <div>this is comment</div> -->
open
End now one noframes tag.
<noframes><span>text</span></noframes>
<select name="select" id="select"><option>ttttt</option></select>
<div class="robots-nocontent"><span>something</span></div>
<img src="url.png" alt="this is alt attribute" />
I need this result:
this is the page title this is the description k1, k2, k3, k4 start the body content this is title attribute open End now one noframes tag. text ttttt something this is alt attribute
I need too the title and the alt attributes.
Idea?
You could do it with a regex.
$regex = '/\<.\>/';
would be a very simple start to remove anything with < and > around it. But in order to do this, you're going to have to pull in the HTML as a file_get_contents() or some other function that will turn the code into text.
Addendum:
If you want individual attributes pulled as well, you're going to have to write a more complex regex to pull that text out. For instance:
$regex2 = '/\<.(?<=(title))(\=\").(?=\")/';
Would pull out (I think... I'm still learning RegEx) any text between < and title=", assuming you had no other matching expressions before title. Again, this would be a pretty complicated regex process.
This cannot be done in an automated way. PHP cannot know which node attributes you want to omit. You'd either had to create some code that iterates over all attributes and textnodes which you can feed a map, defining when to use a node's content or you just pick what you want with XPath one by one.
An alternative would be to use XMLReader. It allows you to iterate over the entire document and define callbacks for the element names. This way, you can define what to do with what element. See
http://www.ibm.com/developerworks/library/x-pullparsingphp.html
My solution is a bit more complicate but it worked fine for me.
If you are sure that you have XHTML, you can simply consider the code as XML (but you have to put everything in a proper wrapping).
Then with XSLT you can define some basic templates that do what you need.

Categories