I'm trying to parse data like this:
<vin:layout name="Page" xmlns:vin="http://www.example.com/vin">
<header>
{someText}
<div>
<!-- some invalid xml code -->
<aas>
<nav class="main">
<vin:show section="Menu" />
</nav>
</div>
</header>
</vin:layout>
How can I parse data like this in PHP?
I tried DOM but it not works, because of the malformed xml inside the root element. Can I tell the parser, that everithing without vin namespace is text?
I probably would throw a sort of Tagsoup parser on it. Something that can read your format which apart from that deficiencies looks pretty okay written. Nothing that textually would stay in the way against a simple regular expression based scanner. I called mine Tagsoup with just the four node-types you got: Starttag, Endtag, Text and Comment. For the Tags you need to know about their Tagname and the NamespacePrefix. It's just named similar to XML/HTML for convienience, but in fact this is all "rool your own", so do not stretch these terms to any standards.
A usage to change every tag (starting or ending) that does not have the namespace prefix could look like ($string contains the data you have in your question):
$scanner = new TagsoupIterator($string);
$nsPrefix = 'vin';
foreach ($scanner as $node) {
$isTag = $node instanceof TagsoupTag;
$isOfNs = $isTag && $node->getTagNsPrefix() === $nsPrefix;
if ($isTag && !$isOfNs) {
$node = strtr($node, ['&' => '&', '<' => '<']);
}
echo $node;
}
Output:
<vin:layout name="Page" xmlns:vin="http://www.example.com/vin">
<header>
{someText}
<div>
<!-- some invalid xml code -->
<aas>
<nav class="main">
<vin:show section="Menu" />
</nav>
</div>
</header>
</vin:layout>
A usage to extract everything inside a certain tag of a namespace could look like:
$scanner = new TagsoupIterator($string);
$parser = new TagsoupForwardNavigator($scanner);
$startTagWithNsPrefix = function ($namespace) {
return function (TagsoupNode $node) use ($namespace) {
/* #var $node TagsoupTag */
return $node->getType() === Tagsoup::NODETYPE_STARTTAG
&& $node->getTagNsPrefix() === $namespace;
};
};
$start = $parser->nextCondition($startTagWithNsPrefix('vin'));
$tag = $start->getTagName();
$parser->next();
echo $html = implode($parser->getUntilEndTag($tag));
Output:
<header>
{someText}
<div>
<!-- some invalid xml code -->
<aas>
<nav class="main">
<vin:show section="Menu" />
</nav>
</div>
</header>
Next part is to replace that part of the $string. As Tagsoup offers binary offsets and lengths, this is easy (and I shortcut a little dirty via SimpleXML):
$xml = substr($string, 0, $start->getEnd()) . substr($string, $parser->getOffset());
$doc = new SimpleXMLElement($xml);
$doc[0] = $html;
echo $doc->asXML();
Output:
<vin:layout xmlns:vin="http://www.example.com/vin" name="Page">
<header>
{someText}
<div>
<!-- some invalid xml code -->
<aas>
<nav class="main">
<vin:show section="Menu" />
</nav>
</div>
</header>
</vin:layout>
Depending on the concrete needs this would require to change the implementation. For example this one won't allow to put the same tags into each other. It does not throw you out, however it does not handle that. No idea if you have that case, if so you would need to add some open/close counter, the navigator class could be easily extended for that, even to offer two kind of end-tag finding methods.
The examples given here are using the Tagsoup which you can see at this gist: https://gist.github.com/4415105
Related
I'm trying to learn how to curl/scrape and echo text with php pretty well. So far I've learned how to do it with tags like and unique divs. For ex, below successfully scrapes and echos text using the div class"market"
<?php
$doc = new DOMDocument;
// We don't want to bother with white spaces
$doc->preserveWhiteSpace = false;
// Most HTML Developers are chimps and produce invalid markup...
$doc->strictErrorChecking = false;
$doc->recover = true;
$doc->loadHTMLFile('http://www.nbcnews.com/business');
$xpath = new DOMXPath($doc);
$query = "//div[#class='market']";
$entries = $xpath->query($query);
foreach ($entries as $entry) {
echo trim($entry->textContent); // use `trim` to eliminate spaces
}
However, I'd like to expand that ability to get even more precise, for example in the below situation, where the a class and various div tags are used many times throughout that website, and the only unique aspect of the below code is that it's using different titles, which in this case is "10-year yield." Is it possible to adjust the existing php code I'm using, to scrape using a title identifier? Otherwise, I'm not sure how to grab something like this, without grabbing everything else using similar tags. Thank you for any thoughts! (in the below case I'm trying to echo the "2.20%"
<!-- BEGIN: Quote -->
<li class="row">
<a class="quote" href="/data/bonds/index.html">
<span class="column quote-name" title="10-year yield">10-year
yield</span>
<span class="column quote-col"><span class="pre-currency-symbol">
</span><span stream="last_572094" class="quote-dollar" title="10-year
yield">2.20</span><span class="post-currency-symbol">%</span></span>
<span stream="changePct_572094" class="column quote-change"><span
class="posData">+0.00</span></span>
</a>
</li>
<!-- END: Quote -->
I have a page that contains several hyperlinks. The ones I want to get are of the format:
<html>
<body>
<div id="diva">
<a href="/123" >text2</a>
</div>
<div id="divb">
<a href="/345" >text1</a>
<a href="/678" >text2</a>
</div>
</body>
</html>
I want to extract the three hrefs 123,345,and 678.
I know how to get all the hyperlinks using $gm = $xpath->query("//a") and then loop through them to get the href attribute.
Is there some sort of regexp to get the attributes with the above format only (.i.e "/digits")?
Thanks
XPath 1.0, which is the version supported by DOMXPath(), has no Regex functionalities. Though, you can easily write your own PHP function to execute Regex expression to be called from DOMXPath if you need one, as mentioned in this other answer.
There is XPath 1.0 way to test if an attribute value is a number, which you can use on href attribute value after / character, to test if the attribute value follows the pattern /digits :
//a[number(substring-after(#href,'/')) = substring-after(#href,'/')]
UPDATE :
For the sake of completeness, here is a working example of calling PHP function preg_match from DOMXPath::query() to accomplish the same task :
$raw_data = <<<XML
<html>
<body>
<div id="diva">
<a href="/123" >text2</a>
</div>
<div id="divb">
<a href="/345" >text1</a>
<a href="/678" >text2</a>
</div>
</body>
</html>
XML;
$doc = new DOMDocument;
$doc->loadXML($raw_data);
$xpath = new DOMXPath($doc);
$xpath->registerNamespace("php", "http://php.net/xpath");
$xpath->registerPHPFunctions("preg_match");
// php:function's parameters below are :
// parameter 1: PHP function name
// parameter 2: PHP function's 1st parameter, the pattern
// parameter 3: PHP function's 2nd parameter, the string
$gm = $xpath->query("//a[php:function('preg_match', '~^/\d+$~', string(#href))]");
foreach ($gm as $a) {
echo $a->getAttribute("href") . "\n";
}
I am working with a project that require the use of PHP Simple HTML Dom Parser, and I need a way to add a custom attribute to a number of elements based on class name.
I am able to loop through the elements with a foreach loop, and it would be easy to set a standard attribute such as href, but I can't find a way to add a custom attribute.
The closest I can guess is something like:
foreach($html -> find(".myelems") as $element) {
$element->myattr="customvalue";
}
but this doesn't work.
I have seen a number of other questions on similar topics, but they all suggest using an alternative method for parsing html (domDocument etc.). In my case this is not an option, as I must use Simple HTML DOM Parser.
Did you try it? Try this example (Sample: adding data tags).
include 'simple_html_dom.php';
$html_string = '
<style>.myelems{color:green}</style>
<div>
<p class="myelems">text inside 1</p>
<p class="myelems">text inside 2</p>
<p class="myelems">text inside 3</p>
<p>simple text 1</p>
<p>simple text 2</p>
</div>
';
$html = str_get_html($html_string);
foreach($html->find('div p[class="myelems"]') as $key => $p_tags) {
$p_tags->{'data-index'} = $key;
}
echo htmlentities($html);
Output:
<style>.myelems{color:green}</style>
<div>
<p class="myelems" data-index="0">text inside 1</p>
<p class="myelems" data-index="1">text inside 2</p>
<p class="myelems" data-index="2">text inside 3</p>
<p>simple text 1</p>
<p>simple text 2</p>
</div>
Well, I think it's too old post but still i think it will help somebody like me :)
So in my case I added custom attribute to an image tag
$markup = file_get_contents('pathtohtmlfile');
//Create a new DOM document
$dom = new DOMDocument;
//Parse the HTML. The # is used to suppress any parsing errors
//that will be thrown if the $html string isn't valid XHTML.
#$dom->loadHTML($markup);
//Get all images tags
$imgs = $dom->getElementsByTagName('img');
//Iterate over the extracted images
foreach ($imgs as $img)
{
$img->setAttribute('customAttr', 'customAttrVal');
}
How would I get content from HTML between h3 tags inside an element that has class pricebox? For example, the following string fragment
<!-- snip a lot of other html content -->
<div class="pricebox">
<div class="misc_info">Some misc info</div>
<h3>599.99</h3>
</div>
<!-- snip a lot of other html content -->
The catch is 599.99 has to be the first match returned, that is if the function call is
preg_match_all($regex,$string,$matches)
the 599.99 has to be in $matches[0][1] (because I use the same script to get numbers from dissimilar looking strings with different $regex - the script looks for the first match).
Try using XPath; definitely NOT RegEx.
Code :
$html = new DOMDocument();
#$html->loadHtmlFile('http://www.path.to/your_html_file_html');
$xpath = new DOMXPath( $html );
$nodes = $xpath->query("//div[#class='pricebox']/h3");
foreach ($nodes as $node)
{
echo $node->nodeValue."";
}
Lets say i have the following web page:
<html>
<body>
<div class="transform">
<span>1</span>
</div>
<div class="transform">
<span>2</span>
</div>
<div class="transform">
<span>3</span>
</div>
</body>
</html>
I would like to find all div elements that contain the class transform and to fetch the text in each div element ?
I know I can do that easily with regular expressions, but i would like to know how can I do that without regular expressions, but parsing the xml and finding the required nodes i need.
update
i know that in this example i can just iterate through all the divs. but this is an example just to illustrate what i need.
in this example i need to query for divs that contain the attribute class=transform
thanks!
Could use SimpleXML - see the example below:
$string = "<?xml version='1.0'?>
<html>
<body>
<div class='transform'>
<span>1</span>
</div>
<div>
<span>2</span>
</div>
<div class='transform'>
<span>3</span>
</div>
</body>
</html>";
$xml = simplexml_load_string($string);
$result = $xml->xpath("//div[#class = 'transform']");
foreach($result as $node) {
echo "span " . $node->span . "<br />";
}
Updated it with xpath...
You can use xpath to address the items. For that particular query, you'd use:
div[contains(concat(" ",#class," "), concat(" ","transform"," "))]
Full PHP example:
<?php
$document = new DomDocument();
$document->loadHtml($html);
$xpath = new DomXPath($document);
foreach ($xpath->query('div[contains(concat(" ",#class," "), concat(" ","transform"," "))]') as $div) {
var_dump($div);
}
If you know CSS, here's a handy CSS-selector to XPath-expression mapping: http://plasmasturm.org/log/444/ -- You can find the above example listed there, as well as other common queries.
If you use it a lot, you might find my csslib library handy. It offers a wrapper csslib_DomCssQuery, which is similar to DomXPath, but using CSS-selectors instead.
ok what i wanted can be easily achieved using php xpath:
example:
http://ditio.net/2008/12/01/php-xpath-tutorial-advanced-xml-part-1/