Php html parsing, I want to save parsed elements into an array - php

I'm trying to parse the html page and accessing some of the tags. I am parsing all of those tags and displaying the result in form of indentation which is according to the level of tags e.g. header tags h1, h2, h3 etc. Now, I want to save the resultant data (indented table of contents) into an array along with the name of the tags. Kindly help me to sort out my problem.
Here is my php code... I'm using html dom parser.
include ("simple_html_dom.php");
session_start();
error_reporting(0);
$string = file_get_contents('test.php');
$tags = array(0 => '<h1', 1 => '<h2', 2 => '<h3', 3 => '<h4', 4 => '<h5', 5 => '<h6');
function parser($html, $needles = array()){
$positions = array();
foreach ($needles as $needle){
$lastPos = 0;
while (($lastPos = strpos($html, $needle, $lastPos))!== false)
{
$positions[] = $lastPos;
$lastPos = $lastPos + strlen($needle);
}
unset($needles[0]);
if(count($positions) > 0){
break;
}
}
if(count($positions) > 0){
for ($i = 0; $i < count($positions); $i++) {
?>
<div class="<?php echo $i; ?>" style="padding-left: 20px; font-size: 14px;">
<?php
if($i < count($positions)-1){
$temp = explode('</', substr($html, $positions[$i]+4));
$pos = strpos($temp[0], '>');
echo substr($temp[0], $pos);
parser(substr($html, $positions[$i]+4, $positions[$i+1]-$positions[$i]-4), $needles);
} else {
$temp = explode('</', substr($html, $positions[$i]+4));
$pos = strpos($temp[0], '>');
echo substr($temp[0], $pos+1);
parser(substr($html, $positions[$i]+4), $needles);
}
?>
</div>
<?php
}
} else {
// not found any position of a tag
}
}
parser($string, $tags);

If you wanted to do it using SimpleXML and XPath, there is a shorter and much more readable version you could try...
$xml = new SimpleXMLElement($string);
$tags = $xml->xpath("//h1 | //h2 | //h3 | //h4");
$data = [];
foreach ( $tags as $tag ) {
$elementData['name'] = $tag->getName();
$elementData['content'] = (string)$tag;
$data[] = $elementData;
}
print_r($data);
You can see the pattern in the XPath - it combines any of the elements you need. The use of // means to find at any level and then the name of the element you want to find. These are combined using |, which is the 'or' operator. This could easily be expanded using the same type of expression to build a full set of tags you need.
The program then loops over the elements found and builds an array of each element at a time. Taking the name and content and adding them to the $data array.
Update:
If your file isn't well formed XML, you may have to use DOMDocument and loadHTML. Only a slight difference but is more tollerant of errors...
$string = file_get_contents("links.html");
$xml = new DOMDocument();
libxml_use_internal_errors();
$xml->loadHTML($string);
$xp = new DOMXPath($xml);
$tags = $xp->query("//h1 | //h2 | //h3 | //h4");
$data = [];
foreach ( $tags as $tag ) {
$elementData['name'] = $tag->tagName;
$elementData['content'] = $tag->nodeValue;
$data[] = $elementData;
}
print_r($data);

Related

Extracting multiple strong tags using PHP Simple HTML DOM Parser

I have over 500 pages (static) containing content structures this way,
<section>
Some text
<strong>Dynamic Title (Different on each page)</strong>
<strong>Author name (Different on each page)</strong>
<strong>Category</strong>
(<b>Content</b> <b>MORE TEXT HERE)</b>
</section>
And I need to extract the data as formatted below, using PHP Simple HTML DOM Parser
$title = <strong>Dynamic Title (Different on each page)</strong>
$authot = <strong>Author name (Different on each page)</strong>
$category = <strong>Category</strong>
$content = (<b>Content</b> <b>MORE TEXT HERE</b>)
I have failed so far and can't get my head around it, appreciate any advice or code snippet to help me going on.
EDIT 1,
I have now solved the part with strong tags using,
$html = file_get_html($url);
$links = array();
foreach($html->find('strong') as $a) {
$content[] = $a->innertext;
}
$title= $content[0];
$author= $content[1];
the only remaining issue is --> How to extract content within parentheses? using similar method?
OK first you want to get all of the tags
Then you want to search through those again for the tags and tags
Something like this:
// Create DOM from URL or file
$html = file_get_html('http://www.example.com/');
$strong = array();
// Find all <sections>
foreach($html->find('section') as $element) {
$section = $element->src;
// get <strong> tags from <section>
foreach($section->find('strong') as $strong) {
$strong[] = $strong->src;
}
$title = $strong[0];
$authot = $strong[1];
$category = $strong[2];
}
To get the parts in parentheses - just get the b tag text and then add the () brackets.
Or if you're asking how to get parts in between the brackets - use explode then remove the closing bracket:
$pieces = explode("(", $title);
$different_on_each_page = str_replace(")","",$pieces[1]);
$html_code = 'html';
$dom = new \DOMDocument();
$dom->LoadHTML($html_code);
$xpath = new \DOMXPath($this->dom);
$nodelist = $xpath->query("//strong");
for($i = 0; $i < $nodelist->length; $i++){
$nodelist->item($i)->nodeValue; //gives you the text inside
}
My final code that works now looks like this.
$html = file_get_html($url);
$links = array();
foreach($html->find('strong') as $a) {
$content[] = $a->innertext;
}
$title= $content[0];
$author= $content[1];
$category = $content[2];
$details = file_get_html($url)->plaintext;
$input = $details;
preg_match_all("/\(.*?\)/", $input, $matches);
print_r($matches[0]);

Parse html with regexp

I want to find all <h3> blocks in this example:
<h3>sdf</h3>
sdfsdf
<h3>sdf</h3>
32
<h2>fs</h2>
<h3>23sd</h3>
234
<h1>h1</h1>
(From h3 to other h3 or h2) This regexp find only first h3 block
~\<h3[^>]*\>[^>]+\<\/h3\>.+(?:\<h3|\<h2|\<h1)~is
I use php function preg_match_all (Quote from docs: After the first match is found, the subsequent searches are continued on from end of the last match.)
What i have to modify in my regexp?
ps
<h3>1</h3>
1content
<h3>2</h3>
2content
<h2>h2</h2>
<h3>3</h3>
3content
<h1>h1</h1>
this content have to be parsed as:
[0] => <h3>1</h3>1content
[1] => <h3>2</h3>2content
[2] => <h3>2</h3>3content
with DOMDocument:
$dom = new DOMDocument();
#$dom->loadHTML($html);
$nodes = $dom->getElementsByTagName('body')->item(0)->childNodes;
$flag = false;
$results = array();
foreach ($nodes as $node) {
if ( $node->nodeType == XML_ELEMENT_NODE &&
preg_match('~^h(?:[12]|(3))$~i', $node->nodeName, $m) ):
if ($flag)
$results[] = $tmp;
if (isset($m[1])) {
$tmp = $dom->saveXML($node);
$flag = true;
} else
$flag = false;
elseif ($flag):
$tmp .= $dom->saveXML($node);
endif;
}
echo htmlspecialchars(print_r($results, true));
with regex:
preg_match_all('~<h3.*?(?=<h[123])~si', $html, $matches);
echo htmlspecialchars(print_r($matches[0], true));
You shouldn't use Regex to parse HTML if there is any nesting involved.
Regex
(<(h\d)>.*?<\/\2>)[\r\n]([^\r\n<]+)
Replacement
\1\3
or
$1$3
http://regex101.com/r/uQ3uC2
preg_match_all('/<h3>(.*?)<\/h3>/is', $stringHTML, $matches);

extracting anchor values hidden in div tags

From a html page I need to extract the values of v from all anchor links…each anchor link is hidden in some 5 div tags
<a href="/watch?v=value to be retrived&list=blabla&feature=plpp_play_all">
Each v value has 11 characters, for this as of now am trying to read it by character by character like
<?php
$file=fopen("xx.html","r") or exit("Unable to open file!");
$d='v';
$dd='=';
$vd=array();
while (!feof($file))
{
$f=fgetc($file);
if($f==$d)
{
$ff=fgetc($file);
if ($ff==$dd)
{
$idea='';
for($i=0;$i<=10;$i++)
{
$sData = fgetc($file);
$id=$id.$sData;
}
array_push($vd, $id);
That is am getting each character of v and storing it in sData variable and pushing it into id so as to get those 11 characters as a string(id)…
the problem is…searching for the ‘v=’ through the entire html file and if found reading the 11characters and pushing it into a sData array is sucking, it is taking considerable amount of time…so pls help me to sophisticate the things
<?php
function substring(&$string,$start,$end)
{
$pos = strpos(">".$string,$start);
if(! $pos) return "";
$pos--;
$string = substr($string,$pos+strlen($start));
$posend = strpos($string,$end);
$toret = substr($string,0,$posend);
$string = substr($string,$posend);
return $toret;
}
$contents = #file_get_contents("xx.html");
$old="";
$videosArray=array();
while ($old <> $contents)
{
$old = $contents;
$v = substring($contents,"?v=","&");
if($v) $videosArray[] = $v;
}
//$videosArray is array of v's
?>
I would better parse HTML with SimpleXML and XPath:
// Get your page HTML string
$html = file_get_contents('xx.html');
// As per comment by Gordon to suppress invalid markup warnings
libxml_use_internal_errors(true);
// Create SimpleXML object
$doc = new DOMDocument();
$doc->strictErrorChecking = false;
$doc->loadHTML($html);
$xml = simplexml_import_dom($doc);
// Find a nodes
$anchors = $xml->xpath('//a[contains(#href, "v=")]');
foreach ($anchors as $a)
{
$href = (string)$a['href'];
$url = parse_url($href);
parse_str($url['query'], $params);
// $params['v'] contains what we need
$vd[] = $params['v']; // push into array
}
// Clear invalid markup error buffer
libxml_clear_errors();

Return in array

I have these php lines:
<?php
$start_text = '<username="';
$end_text = '" userid=';
$source = file_get_contents('http://mysites/users.xml');
$start_pos = strpos($source, $start_text) + strlen($start_text);
$end_pos = strpos($source, $end_text) - $start_pos;
$found_text = substr($source, $start_pos, $end_pos);
echo $found_text;
?>
I want to see just the names from entire file, but it shows me just the first name. I want to see all names.
I think it is something like: foreach ($found_text as $username).... but here I am stuck.
Update from OP post, below:
<?php
$xml = simplexml_load_file("users.xml");
foreach ($xml->children() as $child)
{
foreach($child->attributes() as $a => $b)
{
echo $a,'="',$b,"\"</br>";
}
foreach ($child->children() as $child2)
{
foreach($child2->attributes() as $c => $d)
{
echo "<font color='red'>".$c,'="',$d,"\"</font></br>";
}
}
}
?>
with this code, i receive all details about my users, but from all these details i want to see just 2 or 3
Now i see :
name="xxx"
type="default"
can_accept="true"
can_cancel="false"
image="avatars/trophy.png"
title="starter"
........etc
Another details from the same user "Red color(defined on script)"
reward_value="200"
reward_qty="1"
expiration_date="12/07/2012"
.....etc
what i want to see?
i.e first line from first column "name="xxx" & expiration_date="12/07/2012" from second column
You will need to repeat the loop, using the 3rd parameter, offset, of the strpos function. That way, you can look for a new name each time.
Something like this (untested)
<?php
$start_text = '<username="';
$end_text = '" userid=';
$source = file_get_contents('http://mysites/users.xml');
$offset = 0;
while (false !== ($start_pos = strpos($source, $start_text, $offset)))
{
$start_pos += strlen($start_text);
$end_pos = strpos($source, $end_text, $offset);
$offset = $end_pos;
$text_length = $end_pos - $start_pos;
$found_text = substr($source, $start_pos, $text_length);
echo $found_text;
}
?>
You should either use XMLReader or DOM or SimpleXML to read XML files. If you don't see the necessity, try the following regular expressions approach to retrieve all usernames:
<?php
$xml = '<xml><username="hello" userid="123" /> <something /> <username="foobar" userid="333" /></xml>';
if (preg_match_all('#<username="(?<name>[^"]+)"#', $xml, $matches, PREG_PATTERN_ORDER)) {
var_dump($matches['name']);
} else {
echo 'no <username="" found';
}

Strip tag with class in PHP

So I need to strip the span tags of class tip.
So that would be <span class="tip"> and the corresponding </span>, and everything inside it...
I suspect a regular expression is needed but I terribly suck at this.
Laugh...
<?php
$string = 'April 15, 2003';
$pattern = '/(\w+) (\d+), (\d+)/i';
$replacement = '${1}1,$3';
echo preg_replace($pattern, $replacement, $string);
?>
Gives no error... But
<?php
$str = preg_replace('<span class="tip">.+</span>', "", '<span class="rss-title"></span><span class="rss-link">linkylink</span><span class="rss-id"></span><span class="rss-content"></span><span class=\"rss-newpost\"></span>');
echo $str;
?>
Gives me the error:
Warning: preg_replace() [function.preg-replace]: Unknown modifier '.' in <A FILE> on line 4
previously, the error was at the ); in the 2nd line, but now.... >.>
This is the "proper" method (adapted from this answer).
Input:
<?php
$str = '<div>lol wut <span class="tip">remove!</span><span>don\'t remove!</span></div>';
?>
Code:
<?php
function recurse(&$doc, &$parent) {
if (!$parent->hasChildNodes())
return;
for ($i = 0; $i < $parent->childNodes->length; ) {
$elm = $parent->childNodes->item($i);
if ($elm->nodeName == "span") {
$class = $elm->attributes->getNamedItem("class")->nodeValue;
if (!is_null($class) && $class == "tip") {
$parent->removeChild($elm);
continue;
}
}
recurse($doc, $elm);
$i++;
}
}
// Load in the DOM (remembering that XML requires one root node)
$doc = new DOMDocument();
$doc->loadXML("<document>" . $str . "</document>");
// Iterate the DOM
recurse($doc, $doc->documentElement);
// Output the result
foreach ($doc->childNodes->item(0)->childNodes as $node) {
echo $doc->saveXML($node);
}
?>
Output:
<div>lol wut <span>don't remove!</span></div>
A simple regular expression like:
<span class="tip">.+</span>
Wont work, the issue being that if another span was opened and closed inside the tip span, your regex will terminate with its ending, rather than the tip one. DOM Based tools like the one linked in the comments will really provide a more reliable answer.
As per my comment below, you need to add pattern delimiters when working with regular expressions in PHP.
<?php
$str = preg_replace('\<span class="tip">.+</span>\', "", '<span class="rss-title"></span><span class="rss-link">linkylink</span><span class="rss-id"></span><span class="rss-content"></span><span class=\"rss-newpost\"></span>');
echo $str;
?>
may be moderately more successful. Please take a look at the documentation page for the function in question.
Now without regexp, and without heavy XML parsing:
$html = ' ... <span class="tip"> hello <span id="x"> man </span> </span> ... ';
$tag = '<span class="tip">';
$tag_close = '</span>';
$tag_familly = '<span';
$tag_len = strlen($tag);
$p1 = -1;
$p2 = 0;
while ( ($p2!==false) && (($p1=strpos($html, $tag, $p1+1))!==false) ) {
// the tag is found, now we will search for its corresponding closing tag
$level = 1;
$p2 = $p1;
$continue = true;
while ($continue) {
$p2 = strpos($html, $tag_close, $p2+1);
if ($p2===false) {
// error in the html contents, the analysis cannot continue
echo "ERROR in html contents";
$continue = false;
$p2 = false; // will stop the loop
} else {
$level = $level -1;
$x = substr($html, $p1+$tag_len, $p2-$p1-$tag_len);
$n = substr_count($x, $tag_familly);
if ($level+$n<=0) $continue = false;
}
}
if ($p2!==false) {
// delete the couple of tags, the farest first
$html = substr_replace($html, '', $p2, strlen($tag_close));
$html = substr_replace($html, '', $p1, $tag_len);
}
}

Categories