How can I parse HTML in batches using xpath [PHP]?

How can I parse HTML in batches using xpath [PHP]? - php

I tried all sorts of things but couldn't find a solution.
I want to retrieve elements from html code using xpath in php.
Ex:
<div class='student'>
<div class='name'>Michael</div>
<div class='age'>26</div>
</div>
<div class='student'>
<div class='name'>Joseph</div>
<div class='age'>27</div>
</div>
I want to retrieve the information and put them in an array as follows:
$student[0][name] = Michael;
$student[0][age] = 26;
$student[1][name] = Joseph;
$student[1][age] = 27;`
In other words i want the matching ages to stay with the names.
I tried the following:
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpathDom = new DomXPath($dom);
$homepostcontentNodes = $xpathDom->query("//*[contains(#class, 'student')]//*[contains(#class, 'name')]");`
However, this is only grabbing me the nodes 'names'
How can i get the matching age nodes?

Of course it is only grabbing the nodes name - you are telling it to!
What you will need to do is in two steps:
Pick out all the student nodes
For each student node, pick out the columns
This is a pretty standard step in linearization of data, and the XPath queries are simple:
Step 1
You pretty much have it:
$studentNodes = $xpathDom->query("//div[contains(#class, 'student')]");
This will return all your student nodes.
Step 2
This is where the magic happens. We have our nodes, we can loop through them (DOMNodeList implements Iterator, so we can foreach-loop through them). What we need to figure out is how to find its children...
...Oh wait. DOMNode implements a method called getNodePath which returns the full, direct XPath path to the node. This allows us to then simply append /div to get all the div direct descendents to the node!
Another quick foreach, and we get this code:
$studentNodes = $xpathDom->query("//div[contains(#class, 'student')]");
$result = array();
foreach ($studentNodes as $v) {
// Child nodes: student
$r = array();
$columns = $xpathDom->query($v->getNodePath()."/div");
foreach ($columns as $v2) {
// Attributes allows me to get the 'class' property of the node. Bit clunky, but there's no alternative
$r[$v2->attributes->getNamedItem("class")->textContent] = $v2->textContent;
}
$result[] = $r;
}
var_dump($result);
Full fiddle: http://codepad.viper-7.com/t868Wh

Related

Add "first" and "last" classes to strings containing one or more <p> tags in PHP

I have two strings I'm outputting to a page
# string 1
<p>paragraph1</p>
# string 2
<p>paragraph1</p>
<p>paragraph2</p>
<p>paragraph3</p>
What I'd like to do is turn them into this
# string 1
<p class="first last">paragraph1</p>
# string 2
<p class="first">paragraph1</p>
<p>paragraph2</p>
<p class="last">paragraph3</p>
I'm essentially trying to replicate the css equivalent of first-child and last-child, but I have to physically add them to the tags as I cannot use CSS. The strings are part of a MPDF document and nth-child is not supported on <p> tags.
I can iterate through the strings easy enough to split the <p> tags into an array
$dom = new DOMDocument();
$question_paragraphs = array();
$dom->loadHTML($q['quiz_question']);
foreach($dom->getElementsByTagName('p') as $node)
{
$question_paragraphs[] = $dom->saveHTML($node);
}
But once I have that array I'm struggling to find a nice clean way to append and prepend the first and last class to either end of the array. I end up with lots of ugly loops and array splicing that feels very messy.
I'm wondering if anyone has any slick ways to do this? Thank you :)
Edit Note: The two strings are outputting within a while(array) loop as they're stored in a database.

You can index the node list with the item() method, so you can add the attribute to the first and last elements in the list.
$dom = new DOMDocument();
$question_paragraphs = array();
$dom->loadHTML($q['quiz_question']);
$par = $dom->getElementsByTagName('p');
if ($par->length == 1) {
$par->item(0)->setAttribute("class", "first last");
} elseif ($par->length > 1) {
$par->item(0)->setAttribute("class", "first");
$par->item($par->length - 1)->setAttribute("class", "last");
}
foreach($par as $node)
{
$question_paragraphs[] = $dom->saveHTML($node);
}

Simple html dom parser get tr from table

I am trying to scrap http://spys.one/free-proxy-list/but here i just want get Proxy by ip:port column only
i checked the website there was 3 table
Anyone can help me out?
<?php
require "scrapper/simple_html_dom.php";
$html=file_get_html("http://spys.one/free-proxy-list/");
$html=new simple_html_dom($html);
$rows = array();
$table = $html->find('table',3);
var_dump($table);

Try the below script. It should fetch you only the required items and nothing else:
<?php
include 'simple_html_dom.php';
$url = "http://spys.one/free-proxy-list/";
$html = file_get_html($url);
foreach($html->find("table[width='65%'] tr[onmouseover]") as $file) {
$data = $file->find('td', 0)->plaintext;
echo $data . "<br/>";
}
?>
Output it produces like:
176.94.2.84
178.150.141.93
124.16.84.208
196.53.99.7
31.146.161.238

I really don 't know, what your simple html dom library does. Anyway. Nowadays PHP has all aboard what you need for parsing specific dom elements. Just use PHPs own DOMXPath class for querying dom elements.
Here 's a short example for getting the first column of a table.
$dom = new \DOMDocument();
$dom->loadHTML('https://your.url.goes.here');
$xpath = new \DomXPath($dom);
// query the first column with class "value" of the table with class "attributes"
$elements = $xpath->query('(/table[#class="attributes"]//td[#class="value"])[1]');
// iterate through all found td elements
foreach ($elements as $element) {
echo $element->nodeValue;
}
This is a possible example. It does not solve exactly your issue with http://spys.one/free-proxy-list/. But it shows you how you could easily get the first column of a specific table. The only thing you have to do now is finding the right query in the dom of the given site for the table you want to query. Because the dom of the given site is a pretty complex table layout from ages ago and the table you want to parse does not have a unique id or something else, you have to find out.

Simple HTML DOM Parser - find class with random number

I'm trying to scrap data from one websites. I stuck on ratings.
They have something like this:
<div class="rating-static rating-10 margin-top-none margin-bottom-sm"></div>
<div class="rating-static rating-13 margin-top-none margin-bottom-sm"></div>
<div class="rating-static rating-46 margin-top-none margin-bottom-sm"></div>
Where rating-10 is actually one star, rating-13 two stars in my case, rating-46 will be five stars in my script.
Rating range can be from 0-50.
My plan is to create switch and if I get class range from 1-10 I will know how that is one star, from 11-20 two stars and so on.
Any idea, any help will be appreciated.

Try this
<?php
$data = '<div class="rating-static rating-10 margin-top-none margin-bottom-sm"></div>';
$dom = new DOMDocument;
$dom->loadHTML($data);
$xpath = new DomXpath($dom);
$div = $dom->getElementsByTagName('div')[0];
$div_style = $div->getAttribute('class');
$final_data = explode(" ",$div_style);
echo $final_data[1];
?>
this will give you expected output.

I had an similiar project, this should be the way to do it if you want to parse the whole HTML site
$dom = new DOMDocument();
$dom->loadHTML($html); // The HTML Source of the website
foreach ($dom->getElementsByTagName('div') as $node){
if($node->getAttribute("class") == "rating-static"){
$array = explode(" ", $node->getAttribute("class"));
$ratingArray = explode("-", $array[1]); // $array[1] is rating-10
//$ratingArray[1] would be 10
// do whatever you like with the information
}
}
It could be that you must change the if part to an strpos check, I haven't tested this script, but I think that getAttribute("class") returns all classes. This would be the if statement then
if(strpos($node->getAttribute("class"), "rating-static") !== false)

FYI try using Querypath for future parsing needs. Its just a wrapper around PHP DOM parser and works really really well.

How to get ID using a specific word in regex?

My string:
<div class="sect1" id="s9781473910270.i101">
<div class="sect2" id="s9781473910270.i102">
<h1 class="title">1.2 Summations and Products[label*summation]</h1>
<p>text</p>
</div>
</div>
<div class="sect1" id="s9781473910270.i103">
<p>sometext [ref*summation]</p>
</div>
<div class="figure" id="s9781473910270.i220">
<div class="metadata" id="s9781473910270.i221">
</div>
<p>fig1.2 [label*somefigure]</p>
<p>sometext [ref*somefigure]</p>
</div>
Objective: 1.In the string above label*string and ref*string are the cross references. In the place of [ref*string] I need to replace with a with the atributes of class and href, href is the id of div where related label* resides. And class of a is the class of div
As I mentioned above a element class and ID is their relative div class names and ID. But if div class="metadata" exists, need to ignore it should not take their class name and ID.
Expected output:
<div class="sect1" id="s9781473910270.i101">
<div class="sect2" id="s9781473910270.i102">
<h1 class="title">1.2 Summations and Products[label*summation]</h1>
<p>text</p>
</div>
</div>
<div class="sect1" id="s9781473910270.i103">
<p>sometext <a class="section-ref" href="s9781473910270.i102">1.2</a></p>
</div>
<div class="figure" id="s9781473910270.i220">
<div class="metadata" id="s9781473910270.i221">
<p>fig1.2 [label*somefigure]</p>
</div>
<p>sometext <a class="fig-ref" href="s9781473910270.i220">fig 1.2</a></p>
</div>
How to do it in simpler way without using DOM parser?
My idea is, have to store label* string and their ID in an array and will loop against ref string to match the label* string if string matches then their related id and class should be replaced in the place of ref* string ,
So I have tried this regex to get label*string and their related id and class name.

This approach consists to use the html structure to retrieve needed elements with DOMXPath. Regex are used in a second time to extract informations from text nodes or attributes:
$classRel = ['sect2' => 'section-ref',
'figure' => 'fig-ref'];
libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTML($html); // or $dom->loadHTMLFile($url);
$xp = new DOMXPath($dom);
// make a custom php function available for the XPath query
// (it isn't really necessary, but it is more rigorous than writing
// "contains(#class, 'myClass')" )
$xp->registerNamespace("php", "http://php.net/xpath");
function hasClass($classNode, $className) {
if (!empty($classNode))
return in_array($className, preg_split('~\s+~', $classNode[0]->value, -1, PREG_SPLIT_NO_EMPTY));
return false;
}
$xp->registerPHPFunctions('hasClass');
// The XPath query will find the first ancestor of a text node with '[label*'
// that is a div tag with an id and a class attribute,
// if the class attribute doesn't contain the "metadata" class.
$labelQuery = <<<'EOD'
//text()[contains(., 'label*')]
/ancestor::div
[#id and #class and not(php:function('hasClass', #class, 'metadata'))][1]
EOD;
$idNodeList = $xp->query($labelQuery);
$links = [];
// For each div node, a new link node is created in the associative array $links.
// The keys are labels.
foreach($idNodeList as $divNode) {
// The pattern extract the first text part in group 1 and the label in group 2
if (preg_match('~(\S+) .*? \[label\* ([^]]+) ]~x', $divNode->textContent, $m)) {
$links[$m[2]] = $dom->createElement('a');
$links[$m[2]]->setAttribute('href', $divNode->getAttribute('id'));
$links[$m[2]]->setAttribute('class', $classRel[$divNode->getAttribute('class')]);
$links[$m[2]]->nodeValue = $m[1];
}
}
if ($links) { // if $links is empty no need to do anything
$refNodeList = $xp->query("//text()[contains(., '[ref*')]");
foreach ($refNodeList as $refNode) {
// split the text with square brackets parts, the reference name is preserved in a capture
$parts = preg_split('~\[ref\*([^]]+)]~', $refNode->nodeValue, -1, PREG_SPLIT_DELIM_CAPTURE);
// create a fragment to receive text parts and links
$frag = $dom->createDocumentFragment();
foreach ($parts as $k=>$part) {
if ($k%2 && isset($links[$part])) { // delimiters are always odd items
$clone = $links[$part]->cloneNode(true);
$frag->appendChild($clone);
} elseif ($part !== '') {
$frag->appendChild($dom->createTextNode($part));
}
}
$refNode->parentNode->replaceChild($frag, $refNode);
}
}
$result = '';
$childNodes = $dom->getElementsByTagName('body')->item(0)->childNodes;
foreach ($childNodes as $childNode) {
$result .= $dom->saveXML($childNode);
}
echo $result;

This is not a task for regular expressions. Regular expressions are (usually) for regular languages. And what you want to do is some work on a context sensitive language (referencing an identifier which has been declared before).
So you should definately go with a DOM parser. The algorithm for this would be very easy, because you can operate on one node and it's children.
So the theoretical answer to your question is: you can't. Though it might work out with the many regex extensions in some crappy way.

Simple HTML DOM getting all attributes from a tag

Sort of a two part question but maybe one answers the other. I'm trying to get a piece of information out of an
<div id="foo">
<div class="bar"><a data1="xxxx" data2="xxxx" href="http://foo.bar">Inner text"</a>
<div class="bar2"><a data3="xxxx" data4="xxxx" href="http://foo.bar">more text"</a>
Here is what I'm using now.
$articles = array();
$html=file_get_html('http://foo.bar');
foreach($html->find('div[class=bar] a') as $a){
$articles[] = array($a->href,$a->innertext);
}
This works perfectly to grab the href and the inner text from the first div class. I tried adding a $a->data1 to the foreach but that didn't work.
How do I grab those inner data tags at the same time I grab the href and innertext.
Also is there a good way to get both classes with one statement? I assume I could build the find off of the id and grab all the div information.
Thanks

To grab all those attributes, you should before investigate the parsed element, like this:
foreach($html->find('div[class=bar] a') as $a){
var_dump($a->attr);
}
...and see if those attributes exist. They don't seem to be valid HTML, so maybe the parser discards them.
If they exist, you can read them like this:
foreach($html->find('div[class=bar] a') as $a){
$article = array($a->href, $a->innertext);
if (isset($a->attr['data1'])) {
$article['data1'] = $a->attr['data1'];
}
if (isset($a->attr['data2'])) {
$article['data2'] = $a->attr['data2'];
}
//...
$articles[] = $article;
}
To get both classes you can use a multiple selector, separated by a comma:
foreach($html->find('div[class=bar] a, div[class=bar2] a') as $a){
...

I know this question is old, but the OP asked how they could get all the attributes in one statement. I just did this for a project I'm working on.
You can get all the attributes for an element with the getAllAttributes() method. The results are automatically stored in an array property called attr.
In the example below I am grabbing all links but you can use this with whatever you want. NOTE: This also works with data- attributes. So if there is an attribute called data-url it will be accessible with $e->attr['data-url'] after you run the getAllAttributes method.
In your case the attributes your looking for will be $e->attr['data1'] and $e->attr['data2']. Hope this helps someone if not the OP.
Get all Attributes
$html = file_get_html('somefile.html');
foreach ($html->find('a') as $e) { //used a tag here, but use whatever you want
$e->getAllAttributes();
//testing that it worked
print_r($e->attr);
}

Check this code
<?php
$html = file_get_html('somefile.html');
foreach ($html->find('a') as $e) {
$filter = $e->getAttribute('data-filter-string');
}
?>

$data1 = $html->find('.bar > a', 0)->attr['data1'];
$data2 = $html->find('.bar > a', 0)->attr['data2'];

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How can I parse HTML in batches using xpath [PHP]? - php

Related

Add "first" and "last" classes to strings containing one or more <p> tags in PHP

Simple html dom parser get tr from table

Simple HTML DOM Parser - find class with random number

How to get ID using a specific word in regex?

Simple HTML DOM getting all attributes from a tag

Categories

Resources