I have a HUGE HTML document that I need to parse.
The document is a list of <p> elements all (direct) children of the body tag.
The difference is the class name. The structure is like this:
<p class="first-level"></p>
<p class="second-level"></p>
<p class="third-level"></p>
<p class="third-level"></p>
<p class="nth-levels just-for-demo-1"></p>
<p class="nth-levels just-for-demo-1"></p>
<p class="third-level"></p>
<p class="second-level"></p>
<p class="third-level"></p>
<p class="nth-levels just-for-demo-2"></p>
<p class="first-level"></p>
<p class="second-level"></p>
<p class="second-level"></p>
<p class="third-level"></p>
And so on. nth-level can be any class name that isn't first-level, second-level or third-level.
Basically it's a multi-level <ul> element very poorly marked-up.
What I want to do is parse it and obtain all <p> elements (including tag, not just innerHTML) that are between one of the class names above.
In the example above, I want to get:
<p class="nth-levels just-for-demo-1"></p>
<p class="nth-levels just-for-demo-1"></p>
and
<p class="nth-levels just-for-demo-2"></p>
How the heck can I do that please?
Thank you.
Using XPath:
//p[not(#class='first-level')][not(#class='second-level')][not(#class='third-level')]
to get the (non?)matching nodes, then you can use this answerto get the outerHTML of the nodes.
Additionaly, if you're familiar with jQuery, then try jQuery port to PHP and you could have a powerful set of tools for matching a set of elements in a document (Selectors) as you used to be with jQuery along side with Hierarchy, Attribute Filters, Child Filters etc,Reference
$doc = new DOMDocument;
$doc->loadHTML(...);
$query = '//p[contains(#class, "just-for-demo-")]';
$xpath = new DOMXPath($doc);
$entries = $xpath->query($query);
foreach ($entries as $entry)
{
// not a best solution yet
$attribute = '';
foreach ($entry->attributes as $attr)
{
$attribute .= "{$attr->name}=\"{$attr->value}\"";
}
echo "<{$entry->nodeName}{$attribute}>{$entry->nodeValue}</{$entry->nodeName}>";
}
You could open the file (with fopen or something similar) and read one line at a time. Then just check if the required string is in the line (for example with strstr) and if yes, then add it to an array or do what you need with the line.
Note: this only works if the paragraphs are on different lines each.
fopen documentation
strstr documentation
Related
I am attempting to get various elements inside of an li as shown below. I am pretty new to this so I may not be using the most efficient methods but this is where I have started...
EXAMPLE CODE SIMPLIFIED....
<li id='entry_0' title='09879879'>
<div ....>
<h2> The title text would go here </h2>
<span class='entrySize' ....> 20oz </span>
<span class='entryPrice' ....> $32.09 </span>
<span class='anotherEntry' ....> More Data I need To Grab </span>
.......
</div>
</li>
<li> .... With same structure as above .... 100's of entries like this </li>
I know how to pull individual parts separately but having trouble grasping how to do it grouped within a portion of the html.
$filename = "directory/file.html";
$html = file_get_html($filename);
for($i=0; $i<=count(entryNumber);$i++)
{
$li_id = "entry_".$i;
foreach($html->find('li[id='.$li_id.']') as $li) {
echo $li->innertext;
}
}
So this gets me the content in the line item tag with the id number as the unique attribute. I would like to grab the h2 text, entrySize, entryPrice etc as I iterate through the line item tags. What I don't understand is once I have the line item tag content how can I parse through that line item inner tags and attributes. There maybe other parts of the full HTML document that has tags with same id, class as these throughout the document so I am breaking this down to portions and than looking to parse each section at a time.
I would also like to pull the title attribute out of the title tag for the li tag.
I hope my explanation make sense.
You should probably use a DOM parser. PHP comes bundled with one, and there are many other's you could use.
http://php.net/dom
PHP Simple HTML DOM Parser
<?php
$html = file_get_content($page);
$doc = new DOMDocument();
$doc->loadHTML($html);
// now find what you need
$items = $dom->getElementsByTagName('li');
foreach ($items as $item) {
$id = $item->getAttribute('id');
if (strpos($id, 'item_') !== false) {
// found matchin li, grab its children
}
}
Use this as a baseline, we can't write all the code for you. Check out the PHP docs to finish this :) From what I have so far, you need to follow the docs to make it grab the child values, and handle them.
I am working with a project that require the use of PHP Simple HTML Dom Parser, and I need a way to add a custom attribute to a number of elements based on class name.
I am able to loop through the elements with a foreach loop, and it would be easy to set a standard attribute such as href, but I can't find a way to add a custom attribute.
The closest I can guess is something like:
foreach($html -> find(".myelems") as $element) {
$element->myattr="customvalue";
}
but this doesn't work.
I have seen a number of other questions on similar topics, but they all suggest using an alternative method for parsing html (domDocument etc.). In my case this is not an option, as I must use Simple HTML DOM Parser.
Did you try it? Try this example (Sample: adding data tags).
include 'simple_html_dom.php';
$html_string = '
<style>.myelems{color:green}</style>
<div>
<p class="myelems">text inside 1</p>
<p class="myelems">text inside 2</p>
<p class="myelems">text inside 3</p>
<p>simple text 1</p>
<p>simple text 2</p>
</div>
';
$html = str_get_html($html_string);
foreach($html->find('div p[class="myelems"]') as $key => $p_tags) {
$p_tags->{'data-index'} = $key;
}
echo htmlentities($html);
Output:
<style>.myelems{color:green}</style>
<div>
<p class="myelems" data-index="0">text inside 1</p>
<p class="myelems" data-index="1">text inside 2</p>
<p class="myelems" data-index="2">text inside 3</p>
<p>simple text 1</p>
<p>simple text 2</p>
</div>
Well, I think it's too old post but still i think it will help somebody like me :)
So in my case I added custom attribute to an image tag
$markup = file_get_contents('pathtohtmlfile');
//Create a new DOM document
$dom = new DOMDocument;
//Parse the HTML. The # is used to suppress any parsing errors
//that will be thrown if the $html string isn't valid XHTML.
#$dom->loadHTML($markup);
//Get all images tags
$imgs = $dom->getElementsByTagName('img');
//Iterate over the extracted images
foreach ($imgs as $img)
{
$img->setAttribute('customAttr', 'customAttrVal');
}
I am after a specific value from a webapge; the product name that is in the h1 tag:
<div id="extendinfo_container">
<h1><strong>Product Name</strong></h1>
<div style="font-size:0;height:4px;"></div>
<p class="text_breadcrumbs">
<img src="arrow_091.gif" align="absmiddle"/>
Product Name<img src="arrow_091.gif" align="absmiddle"/>
<strong>Product Name</strong>
<div class="dotted_line_blue">
<img src="theme_shim.gif" height="1" width="100%" alt=" " />
</div>
</div>
This is a poorly structured website with more than one h1 so I cannot simply do getElementById('h1').
I want to be as specific as possible in which element I get and this is the code I have:
$doc = new DOMDocument();
#$doc->loadHTML(file_get_contents('http://url/to/website'));
// locate <div id="extendinfo_container"><a><h1><strong>(.*)</strong></h1></a> as product name
$x = new DOMXPath($doc);
$pName = $x->query('//div[#id="extendinfo_container"]/a/h1/strong');
var_dump($pName->nodeValue);
This is return null. What query do I need to use to get the content I want?
query() returns a DOMNodeList, which doesn't have a nodeValue property. You have to select one element (i.e. the first):
$pName = $x->query('//div[#id="extendinfo_container"]/a/h1/strong')->item(0);
Or iterate over it:
foreach( $pName as $el) {
var_dump( $el->nodeValue);
}
Either one of these will give you access to a DOMNode, which is what you're looking for.
PHP's DOM is VERY picky about the html you load into it. It will barf and refuse to load even slightly malformed documents.
Turn off error supression (#$doc->loadHTML, remove the #) and make sure that it's not puking on this page you're trying to analyze. Otherwise, your XPath query looks fine, and if the document does get loaded/parsed properly, it SHOULD work.
The query works fine. I was accessing the value wrong. Here is the correct way to access the value:
var_dump($pName->item(0)->nodeValue);
Lets say i have the following web page:
<html>
<body>
<div class="transform">
<span>1</span>
</div>
<div class="transform">
<span>2</span>
</div>
<div class="transform">
<span>3</span>
</div>
</body>
</html>
I would like to find all div elements that contain the class transform and to fetch the text in each div element ?
I know I can do that easily with regular expressions, but i would like to know how can I do that without regular expressions, but parsing the xml and finding the required nodes i need.
update
i know that in this example i can just iterate through all the divs. but this is an example just to illustrate what i need.
in this example i need to query for divs that contain the attribute class=transform
thanks!
Could use SimpleXML - see the example below:
$string = "<?xml version='1.0'?>
<html>
<body>
<div class='transform'>
<span>1</span>
</div>
<div>
<span>2</span>
</div>
<div class='transform'>
<span>3</span>
</div>
</body>
</html>";
$xml = simplexml_load_string($string);
$result = $xml->xpath("//div[#class = 'transform']");
foreach($result as $node) {
echo "span " . $node->span . "<br />";
}
Updated it with xpath...
You can use xpath to address the items. For that particular query, you'd use:
div[contains(concat(" ",#class," "), concat(" ","transform"," "))]
Full PHP example:
<?php
$document = new DomDocument();
$document->loadHtml($html);
$xpath = new DomXPath($document);
foreach ($xpath->query('div[contains(concat(" ",#class," "), concat(" ","transform"," "))]') as $div) {
var_dump($div);
}
If you know CSS, here's a handy CSS-selector to XPath-expression mapping: http://plasmasturm.org/log/444/ -- You can find the above example listed there, as well as other common queries.
If you use it a lot, you might find my csslib library handy. It offers a wrapper csslib_DomCssQuery, which is similar to DomXPath, but using CSS-selectors instead.
ok what i wanted can be easily achieved using php xpath:
example:
http://ditio.net/2008/12/01/php-xpath-tutorial-advanced-xml-part-1/
given the following string in PHP:
$html = "<div>
<p><span class='test1 test2 test3'>text 1</span></p>
<p><span class='test1 test2'>text 2</span></p>
<p><span class='test1'>text 3</span></p>
<p><span class='test1 test3 test2'>text 4</span></p>
</div>";
I just want to either empty or remove any class that has "test2" in it, so the result would be this:
<div>
<p><span class=''>text 1</span></p>
<p><span class=''>text 2</span></p>
<p><span class='test1'>text 3</span></p>
<p><span class=''>text 4</span></p>
</div>
of if you're removing the element:
<div>
<p>text 1</p>
<p>text 2</p>
<p><span class='test1'>text 3</span></p>
<p>text 4</p>
</div>
I'm happy to use a regex expression or something like PHP Simple HTML DOM Parser, but I have no clue how to use it. And with regex, I know how to find the element, but not the specific attribute associated w/ it, especially if there are multiple attributes like my example above. Any ideas?
The DOMDocument class is a very straight-forward and easy-to-understand interface designed to assist you in working with your data in a DOM-like fashion. Querying your DOM with xpath selectors should be the task(s) all the more trivial:
Clear All Classes
// Build our DOMDocument, and load our HTML
$doc = new DOMDocument();
$doc->loadHTML($html);
// Preserve a reference to our DIV container
$div = $doc->getElementsByTagName("div")->item(0);
// New-up an instance of our DOMXPath class
$xpath = new DOMXPath($doc);
// Find all elements whose class attribute has test2
$elements = $xpath->query("//*[contains(#class,'test2')]");
// Cycle over each, remove attribute 'class'
foreach ($elements as $element) {
// Empty out the class attribute value
$element->attributes->getNamedItem("class")->nodeValue = '';
// Or remove the attribute entirely
// $element->removeAttribute("class");
}
// Output the HTML of our container
echo $doc->saveHTML($div);
using the PHP Simple HTML DOM Parser
Updated and tested!
You can get the simple_html_dom.php include from the above link or here.
for both cases:
include('../simple_html_dom.php');
$html = str_get_html("<div><p><span class='test1 test2 test3'>text 1</span></p>
<p><span class='test1 test2'>text 2</span></p>
<p><span class='test1'>text 3</span></p>
<p><span class='test1 test3 test2'>text 4</span></p></div>");
case 1:
foreach($html->find('span[class*="test2"]') as $e)
$e->class = '';
echo $html;
case 2:
foreach($html->find('span[class*="test2"]') as $e)
$e->parent()->innertext = $e->plaintext;
echo $html;
$notest2 = preg_replace(
"/class\s*=\s*'[^\']*test2[^\']*'/",
"class=''",
$src);
C.
You can use any DOM Parser, iterate over every element. Check whether its class attribute contains test2 class (strpos()) if so then set empty string as a value for class attribute.
You can also use regular expressions to do that - much shorter way. Just find and replace (preg_replace()) using the following expression: #class=".*?test2.*?"#is