Hay guys i need help on a regex.
I'm using file_get_contents() to get the source of a page, i want to then loop through the source and find all the and extract all the HREF values into an array.
Thanks
You should better use a real parser like SimpleXML or DOMDocument than regular expressions. Here’s an example with DOMDocument that will give you an array of A elements:
$doc = new DOMDocument();
$doc->loadHTML($str);
$aElements = $doc->getElementsByTagName("a");
foreach ($aElements as $aElement) {
if ($aElement->hasAttribute("href")) {
// link; use $aElement->getAttribute("href") to retrieve the value
} else {
// not a link
}
}
Related
Given the php code:
$xml = <<<EOF
<articles>
<article>
This is a link
<link>Title</link>
with some text following it.
</article>
</articles>
EOF;
function traverse($xml) {
$result = "";
foreach($xml->children() as $x) {
if ($x->count()) {
$result .= traverse($x);
}
else {
$result .= $x;
}
}
return $result;
}
$parser = new SimpleXMLElement($xml);
traverse($parser);
I expected the function traverse() to return:
This is a link Title with some text following it.
However, it returns only:
Title
Is there a way to get the expected result using simpleXML (obviously for the purpose of consuming the data rather than just returning it as in this simple example)?
There might be ways to achieve what you want using only SimpleXML, but in this case, the simplest way to do it is to use DOM. The good news is if you're already using SimpleXML, you don't have to change anything as DOM and SimpleXML are basically interchangeable:
// either
$articles = simplexml_load_string($xml);
echo dom_import_simplexml($articles)->textContent;
// or
$dom = new DOMDocument;
$dom->loadXML($xml);
echo $dom->documentElement->textContent;
Assuming your task is to iterate over each <article/> and get its content, your code will look like
$articles = simplexml_load_string($xml);
foreach ($articles->article as $article)
{
$articleText = dom_import_simplexml($article)->textContent;
}
node->asXML();// It's the simple solution i think !!
So, the simple answer to my question was: Simplexml can't process this kind of XML. Use DomDocument instead.
This example shows how to traverse the entire XML. It seems that DomDocument will work with any XML whereas SimpleXML requires the XML to be simple.
function attrs($list) {
$result = "";
foreach ($list as $attr) {
$result .= " $attr->name='$attr->value'";
}
return $result;
}
function parseTree($xml) {
$result = "";
foreach ($xml->childNodes AS $item) {
if ($item->nodeType == 1) {
$result .= "<$item->nodeName" . attrs($item->attributes) . ">" . parseTree($item) . "</$item->nodeName>";
}
else {
$result .= $item->nodeValue;
}
}
return $result;
}
$xmlDoc = new DOMDocument();
$xmlDoc->loadXML($xml);
print parseTree($xmlDoc->documentElement);
You could also load the xml using simpleXML and then convert it to DOM using dom_import_simplexml() as Josh said. This would be useful, if you are using simpleXml to filter nodes for parsing, e.g. using XPath.
However, I don't actually use simpleXML, so for me that would be taking the long way around.
$simpleXml = new SimpleXMLElement($xml);
$xmlDom = dom_import_simplexml($simpleXml);
print parseTree($xmlDom);
Thank you for all the help!
You can get the text node of a DOM element with simplexml just by treating it like a string:
foreach($xml->children() as $x) {
$result .= "$x"
However, this prints out:
This is a link
with some text following it.
TitleTitle
..because the text node is treated as one block and there is no way to tell where the child fits in inside the text node. The child node is also added twice because of the other else {}, but you can just take that out.
Sorry if I didn't help much, but I don't think there's any way to find out where the child node fits in the text node unless the xml is consistent (but then, why not use tags). If you know what element you want to strip the text out of, strip_tags() will work great.
This has already been answered, but CASTING TO STRING ( i.e. $sString = (string) oSimpleXMLNode->TagName) always worked for me.
Try this:
$parser = new SimpleXMLElement($xml);
echo html_entity_decode(strip_tags($parser->asXML()));
That's pretty much equivalent to:
$parser = simplexml_load_string($xml);
echo dom_import_simplexml($parser)->textContent;
Like #tandu said, it's not possible, but if you can modify your XML, this will work:
$xml = <<<EOF
<articles>
<article>
This is a link
</article>
<link>Title</link>
<article>
with some text following it.
</article>
</articles>
I'm using preg_match_all method to get the urls from inside the anchor tag on a page. It works but when i'm getting them, before adding them to the array i would like to wrap them with '(like this 'url'):
preg_match_all('!<a href="(.*?)">!', $anchors, $urls);
Is there a way to do that? If yes, can you point me towards the right direction and the proper way that this could be done?
Thank you! :D
Instead of using a regex to parse html you could use DOMDocument and getElementsByTagName
$dom = new DOMDocument;
$dom->loadHTMLFile("yourfile.html");
$anchors= $dom->getElementsByTagName("a");
$hrefs = [];
foreach ($anchors as $anchor) {
if ($anchor->hasAttribute("href")) {
$hrefs[] = "'{$anchor->getAttribute('href')}'";
}
}
Say I have the following string:
<a name="anchor" title="anchor title">
Currently I can extract name and title with strpos and substr, but I want to do it right. How can I do this with regex? And what if I wanted to extract from many of these tags within a block of text?
I've tried this regex:
/name="([A-Z,a-z])\w+/g
But it gets the name=" part as well, I just want the value.
The regex (\S+)=["']?((?:.(?!["']?\s+(?:\S+)=|[>"']))+.)["']? can be used to extract all attributes
DOMDocument example:
<?php
$titles = array();
$doc = new DOMDocument();
$doc->loadHTML("<html><body>Test<br><a name="anchor" title="anchor title"></body></html>");
$links = $doc->getElementsByTagName('a');
if ($links->length!=0) {
foreach ($links as $a) {
$titles[] = $a->getAttribute('title');
}
}
?>
You commented: "I'm actually parsing the data before the page is rendered so DOM is not possible, right?"
We're working with the scraped HTML, so we construct a DOM with these functions and parse like XML.
Good examples in the comments here: http://php.net/manual/en/domdocument.getelementsbytagname.php
i want get all link in page by class "page1" in php.
the same code in jquery
$("a#page1").echo(function()
{
});
can do that in php?
$pattern = '`.*?((http|ftp)://[\w#$&+,\/:;=?#%.-]+)[^\w#$&+,\/:;=?#%.-]*?`i';
preg_match_all($pattern,$page_g,$matches);
this code get all href in the $page_g but its not work for class="page1".
i want only all href in $page_g by class="page1"
can help me for optimize reqular ex or other way?
for example
$page_g="the <strong>office</strong> us s01 05 xvid mu asd";
i want return only /?s=cache:16001429:office+s01e02
tnx
You lack the expertise to use a regular expression for that. Hencewhy using DOMdocument is the advisable solution here. If you want to have a simpler API then use the jQuery-lookalikes phpQuery or QueryPath:
$link = qp($html)->find("a#page1")->attr("href");
print $link;
Edit Edited since you clarified the question.
To get all <a> links with the class .page1:
// Load the HTML from a file
$your_HTML_string = file_get_contents("html_filename.html");
$doc = new DOMDocument();
$doc->loadHTML($your_HTML_string);
// Then select all <a> tags under #page1
$a_links = $doc->getElementsByTagName("a");
foreach ($a_links as $link) {
// If they have more than one class,
// you'll need to use (strpos($link->getAttribute("class"), "page1") >=0)
// instead of == "page1"
if ($link->getAttribute("class") == "page1") {
// do something
}
}
Use DomDocument to parse HTML page, here's a tutorial:
Tutorial
DOM is preferred to be used here, as regex is difficult to maintain if underlying HTML changes, besides, DOM can deal with invalid HTML and provides you access to other HTML parsing related tools.
So, assuming that have a file that contains HTML, and you are searching for classes, this could be the way to go:
$doc = new DOMDocument;
$doc->load(PATH_TO_YOUR_FILE);
//we will use Xpath to find all a containing your class, as a tag can have more than one class and it's just easier to do it with Xpath.
$xpath = new DOMXpath($doc);
$list = $xpath->query("//a[contains(#class, 'page1')]");
foreach ($list as $a_tag) {
$href = $a_tag->getAttribute('href');
//do something
}
Given the php code:
$xml = <<<EOF
<articles>
<article>
This is a link
<link>Title</link>
with some text following it.
</article>
</articles>
EOF;
function traverse($xml) {
$result = "";
foreach($xml->children() as $x) {
if ($x->count()) {
$result .= traverse($x);
}
else {
$result .= $x;
}
}
return $result;
}
$parser = new SimpleXMLElement($xml);
traverse($parser);
I expected the function traverse() to return:
This is a link Title with some text following it.
However, it returns only:
Title
Is there a way to get the expected result using simpleXML (obviously for the purpose of consuming the data rather than just returning it as in this simple example)?
There might be ways to achieve what you want using only SimpleXML, but in this case, the simplest way to do it is to use DOM. The good news is if you're already using SimpleXML, you don't have to change anything as DOM and SimpleXML are basically interchangeable:
// either
$articles = simplexml_load_string($xml);
echo dom_import_simplexml($articles)->textContent;
// or
$dom = new DOMDocument;
$dom->loadXML($xml);
echo $dom->documentElement->textContent;
Assuming your task is to iterate over each <article/> and get its content, your code will look like
$articles = simplexml_load_string($xml);
foreach ($articles->article as $article)
{
$articleText = dom_import_simplexml($article)->textContent;
}
node->asXML();// It's the simple solution i think !!
So, the simple answer to my question was: Simplexml can't process this kind of XML. Use DomDocument instead.
This example shows how to traverse the entire XML. It seems that DomDocument will work with any XML whereas SimpleXML requires the XML to be simple.
function attrs($list) {
$result = "";
foreach ($list as $attr) {
$result .= " $attr->name='$attr->value'";
}
return $result;
}
function parseTree($xml) {
$result = "";
foreach ($xml->childNodes AS $item) {
if ($item->nodeType == 1) {
$result .= "<$item->nodeName" . attrs($item->attributes) . ">" . parseTree($item) . "</$item->nodeName>";
}
else {
$result .= $item->nodeValue;
}
}
return $result;
}
$xmlDoc = new DOMDocument();
$xmlDoc->loadXML($xml);
print parseTree($xmlDoc->documentElement);
You could also load the xml using simpleXML and then convert it to DOM using dom_import_simplexml() as Josh said. This would be useful, if you are using simpleXml to filter nodes for parsing, e.g. using XPath.
However, I don't actually use simpleXML, so for me that would be taking the long way around.
$simpleXml = new SimpleXMLElement($xml);
$xmlDom = dom_import_simplexml($simpleXml);
print parseTree($xmlDom);
Thank you for all the help!
You can get the text node of a DOM element with simplexml just by treating it like a string:
foreach($xml->children() as $x) {
$result .= "$x"
However, this prints out:
This is a link
with some text following it.
TitleTitle
..because the text node is treated as one block and there is no way to tell where the child fits in inside the text node. The child node is also added twice because of the other else {}, but you can just take that out.
Sorry if I didn't help much, but I don't think there's any way to find out where the child node fits in the text node unless the xml is consistent (but then, why not use tags). If you know what element you want to strip the text out of, strip_tags() will work great.
This has already been answered, but CASTING TO STRING ( i.e. $sString = (string) oSimpleXMLNode->TagName) always worked for me.
Try this:
$parser = new SimpleXMLElement($xml);
echo html_entity_decode(strip_tags($parser->asXML()));
That's pretty much equivalent to:
$parser = simplexml_load_string($xml);
echo dom_import_simplexml($parser)->textContent;
Like #tandu said, it's not possible, but if you can modify your XML, this will work:
$xml = <<<EOF
<articles>
<article>
This is a link
</article>
<link>Title</link>
<article>
with some text following it.
</article>
</articles>