Iam making a webcrawler and I need to extract the metadata that contains the description, this is what I did:
$html = file_get_contents('http://www.google.com');
preg_match('/<meta name="description" content="(.*)"/>\i', $html, $description);
$description_out = $description;
var_dump($description_out);
and I get this error
Warning: preg_match(): Unknown modifier '>' in
C:\xampp\htdocs\webcrawler\php-web-crawler\index.php on line 21
What is the correct regular expression?
Your pattern is incorrect. You start with a / delimiter and then you have an unescaped / in the pattern this ends the pattern and everything after it is read as modifiers.
Then your end delimiter was on the wrong way, was \ should be /.
'/<meta name="description" content="(.*)"\/>/i',
As an alternative, instead of using a regex you might use DOMDocument and DOMXPath with an xpath expression /html/head/meta[#name="description"]/#content to get the content attribute.
$document = new DOMDocument();
$document->loadHTML($html);
$xpath = new DOMXPath($document);
$items = $xpath->query('/html/head/meta[#name="description"]/#content');
foreach ($items as $item) {
echo $item->value . "<br>";
}
The $items are of type DOMNodeList which you could loop using for example a foreach. The $item is of type DOMAttr from which you can get the value.
Related
I would like to get back the number which is between span HTML tags. The number may change!
<span class="topic-count">
::before
"
24
"
::after
</span>
I've tried the following code:
preg_match_all("#<span class=\"topic-count\">(.*?)</span>#", $source, $nombre[$i]);
But it doesn't work.
Entire code:
$result=array();
$page = 201;
while ($page>=1) {
$source = file_get_contents ("http://www.jeuxvideo.com/forums/0-27047-0-1-0-".$page."-0-counter-strike-global-offensive.htm");
preg_match_all("#<span class=\"topic-count\">(.*?)</span>#", $source, $nombre[$i]);
$result = array_merge($result, $nombre[$i][1]);
print("Page : ".$page ."\n");
$page-=25;
}
print_r ($nombre);
Can do with
preg_match_all(
'#<span class="topic-count">[^\d]*(\d+)[^\d]*?</span>#s',
$html,
$matches
);
which would capture any digits before the end of the span.
However, note that this regex will only work for exactly this piece of html. If there is a slight variation in the markup, for instance, another class or another attribute, the pattern will not work anymore. Writing reliable regexes for HTML is hard.
Hence the recommendation to use a DOM parser instead, e.g.
libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTMLFile('http://www.jeuxvideo.com/forums/0-27047-0-1-0-1-0-counter-strike-global-offensive.htm');
libxml_use_internal_errors(false);
$xpath = new DOMXPath($dom);
foreach ($xpath->evaluate('//span[contains(#class, "topic-count")]') as $node) {
if (preg_match_all('#\d+#s', $node->nodeValue, $topics)) {
echo $topics[0][0], PHP_EOL;
}
}
DOM will parse the entire page into a tree of nodes, which you can then query conveniently via XPath. Note the expression
//span[contains(#class, "topic-count")]
which will give you all the span elements with a class attribute containing the string topic-count. Then if any of these nodes contain a digit, echo it.
Is it possible to search a DOMDocument object with fn:contains and return true on only an exact match for a word?
I have a text replacement snippet that I did not write myself that does internal link replacements for keywords. But as written it also replaces partial words instead of only the full word.
Here is the snippet:
$autolinks = $this->config->get('autolinks');
if (isset($autolinks) && (strpos($this->data['description'], 'iframe') == false)
&& (strpos($this->data['description'], 'object') == false)):
$xdescription = mb_convert_encoding(html_entity_decode($this->data['description'], ENT_COMPAT, "UTF-8"), 'HTML-ENTITIES', "UTF-8");
libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTML('<div>'.$xdescription.'</div>');
libxml_use_internal_errors(false);
$xpath = new DOMXPath($dom);
foreach ($autolinks as $autolink):
$keyword = $autolink['keyword'];
$xlink = mb_convert_encoding(html_entity_decode($autolink['link'], ENT_COMPAT, "UTF-8"), 'HTML-ENTITIES', "UTF-8");
$target = $autolink['target'];
$tooltip = isset($autolink['tooltip']);
$pTexts = $xpath->query(
sprintf('///text()[contains(., "%s")]', $keyword)
);
foreach ($pTexts as $pText):
$this->parseText($pText, $keyword, $dom, $xlink, $target, $tooltip);
endforeach;
endforeach;
$this->data['description'] = $dom->saveXML($dom->documentElement);
endif;
In example:
If my keyword is "massage" *massage*r is partially matched and converted to a link, when only the whole word massage should be converted, not massager.
You should use fn:matches instead of fn:contains. This allows you to do matching with regular expressions. Then you can include word boundaries with \b.
sprintf('///text()[matches(., "\b%s\b")]', $keyword)
Note that this does not affect whatever your function parseText is doing. So while <Tagname>This is a sentence containing the word massager.</Tagname> will be unaffected, I make no guarantee what will happen to <Tagname>The massager give the customer a massage.</Tagname>. To make sure that this is handled properly your parsetext function will need to be modified. Possibly in a similar manner as above.
Note also that the modifications you might need to make to parsetext means that the above change becomes unecessary.
Text manipulation in XSLT 1.0 is very limited, but if you can't move to 2.0 (why not?) then translate() often comes to the rescue. Use translate() to replace all common punctuation characters by spaces, use concat() to add a space fore and aft, and then test for contains(' massage ') (note the spaces).
When matches(), ends-with() are not supported, you can use starts-with() and string-length() to get around.
Example:
[starts-with(.,'$var') and string-length(.)=string-length('$var')]
This is equivalent to matches().
This actually turned out to be incredibly simple, I just added a space onto the end of the $keyword variable so now it only returns true when the entire word is found.
foreach ($autolinks as $autolink):
$keyword = trim($autolink['keyword']) . ' ';
$xlink = mb_convert_encoding(html_entity_decode($autolink['link'], ENT_COMPAT, "UTF-8"), 'HTML-ENTITIES', "UTF-8");
$target = $autolink['target'];
$tooltip = isset($autolink['tooltip']);
$pTexts = $xpath->query(
sprintf('///text()[contains(., "%s")]', $keyword)
);
foreach ($pTexts as $pText):
$this->parseText($pText, $keyword, $dom, $xlink, $target, $tooltip);
endforeach;
endforeach;
thank you to everyone who tried to help.
I would like to get the urls from a webpage that starts with "../category/" from these tags below:
PC<br>
Carpet<br>
Any suggestion would be very much appreciated.
Thanks!
No regular expressions is required. A simple XPath query with DOM will suffice:
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//a[starts-with(#href, "../category/")]');
foreach ($nodes as $node) {
echo $node->nodeValue.' = '.$node->getAttribute('href').PHP_EOL;
}
Will print:
PC = ../category/product/pc.html
Carpet = ../category/product/carpet.html
This regex searches for your ../category/ string:
preg_match_all('#......="(\.\./category/.*?)"#', $test, $matches);
All text literals are used for matching. You can replace the ..... to make it more specific. Only the \. need escaping. The .*? looks for a variable length string. And () captures the matched path name, so it appears in $matches. The manual explains the rest of the syntax. http://www.php.net/manual/en/book.pcre.php
I like to remove any empty html tag which is empty or containing spaces.
something like to get:
$string = "<b>text</b><b><span> </span></b><p> <br/></p><b></b><font size='4'></font>";
to:
$string ="<b>text</b>=;
Here is an approach with DOM:
// init the document
$dom = new DOMDocument;
$dom->loadHTML($string);
// fetch all the wanted nodes
$xp = new DOMXPath($dom);
foreach($xp->query('//*[not(node()) or normalize-space() = ""]') as $node) {
$node->parentNode->removeChild($node);
}
// output the cleaned markup
echo $dom->saveXml(
$dom->getElementsByTagName('body')->item(0)
);
This would output something like
<body><b>text</b></body>
XML documents require a root element, so there is no way to omit that. You can str_replace it though. The above can handle broken HTML.
If you want to selectively remove specific nodes, adjust the XPath query.
Also see
How do you parse and process HTML/XML in PHP?
Locating the node by value containing whitespaces using XPath
function stripEmptyTags ($result)
{
$regexps = array (
'~<(\w+)\b[^\>]*>\s*</\\1>~',
'~<\w+\s*/>~'
);
do
{
$string = $result;
$result = preg_replace ($regexps, '', $string);
}
while ($result != $string);
return $result;
}
$string = "<b>text</b><b><span> </span></b><p> <br/></p><b></b><font size='4'></font>";
echo stripEmptyTags ($string);
You will need to run the code multiple times in order to do this only with regular expressions.
the regex that does this is:
/<(?:(\w+)(?: [^>]*)?`> *<\/$1>)|(?:<\w+(?: [^>]*)?\/>)/g
But for example on your string you have to run it at least twice. Once it will remove the <br/> and the second time will remove the remaining <p> </p>.
I have a script that returns the following in a variable called $content
<body>
<p><span class=\"c-sc\">dgdfgdf</span></p>
</body>
I however need to place everything between the body tag inside an array called matches
I do the following to match the stuff between the body tag
preg_match('/<body>(.*)<\/body>/',$content,$matches);
but the $mathces array is empty, how could I get it to return everything inside the body tag
Don't try to process html with regular expressions! Use PHP's builtin parser instead:
$dom = new DOMDocument;
$dom->loadHTML($string);
$bodies = $dom->getElementsByTagName('body');
assert($bodies->length === 1);
$body = $bodies->item(0);
for ($i = 0; $i < $body->children->length; $i++) {
$body->remove($body->children->item($i));
}
$string = $dom->saveHTML();
You should not use regular expressions to parse HTML.
Your particular problem in this case is you need to add the DOTALL modifier so that the dot matches newlines.
preg_match('/<body>(.*)<\/body>/s', $content, $matches);
But seriously, use an HTML parser instead. There are so many ways that the above regular expression can break.
If for some reason you don't have DOMDocument installed, try this
Step 1. Download simple_html_dom
Step 2. Read the documentation about how to use its selectors
require_once("simple_html_dom.php");
$doc = new simple_html_dom();
$doc->load($someHtmlString);
$body = $doc->find("body")->innertext;