Find and separate the HTML blocks to an array - php

First of all I want to describe the idea - anyone know that any CMS or a simple website has some kind of blocks like the list of articles for example on the main page of wordpress where shown each in a block of information: Title, author, content, date etc.
So the main idea is how to find and separate such blocks of HTML and append each of them to an array.
I thought first need to clear them from: classes, ids and styles.
step1:
<div id="box1">
<h3 class="title_style">Title1</h3>
<p>content for box1</p>
<div class="author">Author Name1<span class="style_date">date1<span>any text</div>
</div>
<div id="box2">
<h3 class="title_style">Title2</h3>
<p>content for box2</p>
<div class="author">Author Name2<span class="style_date">date2<span>any text2</div>
</div>
to
<div>
<h3>Title1</h3>
<p>content for box1</p>
<div>Author Name1<span>date1<span>any text</div>
</div>
<div>
<h3>Title2</h3>
<p>content for box2</p>
<div>Author Name2<span>date2<span>any text2</div>
</div>
Step2:
I need to find each block and write them to an array so I can to put each block to a row in the table like this: (note that this blocks are present on almost any site so it doesn't matter what tags it has, they just repeat with different content and attributes, only the structure is the same)
<table>
<tr id="block1">
<td>Title1</td>
<td>content for box1</td>
<td>Author Name1</td>
<td>date1</td>
<td>any text</td>
</tr>
<tr id="block2">
<td>Title2</td>
<td>content for box2</td>
<td>Author Name2</td>
<td>date2</td>
<td>any text</td>
</tr>
</table>
Any ideas ? I need the logic how to do this, not the code itself.

You can walk the DOM of the document using PHP's DOMDocument class.
So you can do something like this:
$str = <<<STR
<div id="box1">
<h3 class="title_style">Title1</h3>
<p>content for box1</p>
<div class="author">Author Name1<span class="style_date">date1</span>any text</div>
</div>
<div id="box2">
<h3 class="title_style">Title2</h3>
<p>content for box2</p>
<div class="author">Author Name2<span class="style_date">date2</span>any text2</div>
</div>
STR;
$dom = new DOMDocument();
$dom->loadHTML($str);
$divs = $dom->getElementsByTagName('div');
foreach ($divs as $div) {
//read child elements
}

Try this library Simple HTML Dom Parser.

Related

Using DOMXpath to find data in not so nice html

I am trying to get some data from a plant list site. This proves to be a bit problematic because their html isn't really well-formed. These are two lines from the search result (disclaimer: I am not responsible for this code):
<tr>
<td>
<i class="glyphicons-icon leaf"></i>
</td>
<td>
<a title="Cimicifuga simplex" href="/taxon/wfo-0000604773" class="result">
<h4 class="h4Results"><em>Cimicifuga simplex</em>(DC.) Wormsk. ex Turcz.</h4>
</a>
Bull. Soc. Imp. Naturalistes Moscou<br/>
<div>
<em>Status:</em><span id="entryStatus">Synonym of </span>
<em>Actaea simplex</em>(DC.) Wormsk. ex Prantl
</div>
<div>
<em>Rank:</em><span id="entryRank">Species</span>
</div>
<div>
<em>Family:</em> Ranunculaceae
</div>
</td>
<td>
<img title="No Image Available" src="/css/images/no_image.jpg" class="thumbnail pull-right"/>
</td>
</tr>
<tr>
<td>
<i class="glyphicons-icon leaf"></i>
</td>
<td>
<a title="Actaea simplex" href="/taxon/wfo-0000519124" class="result">
<h4 class="h4Results"><strong><em>Actaea simplex</em>(DC.) Wormsk. ex Prantl</strong></h4>
</a>
Bot. Jahrb. Syst.<br/>
<div>
<em>Status:</em><span id="entryStatus">Accepted Name</span>
</div>
<div>
<em>Rank:</em><span id="entryRank">Species</span>
</div>
<div>
<em>Family:</em> Ranunculaceae</div>
<div>
<em>Order:</em> Ranunculales
</div>
</td>
<td>
<img title="No Image Available" src="/css/images/no_image.jpg" class="thumbnail pull-right"/>
</td>
</tr>
I added some layout myself, otherwise it wasn't readable.
Anyway, I loaded the page in php and DOMXpath and now I want to get two things:
Select the row that has Accepted Name in it
Get the species name and the corresponding link from it
In this case the result would be "Actaea simplex" and "/taxon/wfo-0000519124". Mind that there will be more results resembling the first row, and that the position of the row that I am looking for doesn't have to be the second one.
Normally I just try, use google and try some more and in the end I get there, but in this case IDs are used as classes, and are not unique. This make it impossible to use an Xpath tester, and perhaps even useless for DOMXpath.
So, is it possible to get my data with DOMXpath, and if yes - what query do I use?
Try something like:
$dom = new DOMDocument();
$dom->loadXML($xml);
$xpath = new DOMXPath($dom);
$target = $xpath->query("//td[.//span[.='Accepted Name']]/a");
$link = $target[0]->getAttribute('href');
$title = $target[0]->getAttribute('title');
echo $title," ",$link;
Output
Actaea simplex /taxon/wfo-0000519124

Extract links from specific table

I have a html code with many html tables. I want to extract links from specific one which has specific div above.
Here's my sample code:
<div class="boxuniwersal_header">Table 1</div>
<img src="img/boxuniwersal_top.gif" width="210" height="18" alt="" style="margin-top: 5px" />
<div class="boxuniwersal_content">
<div class="boxuniwersal_subcontent">
<div class='menu_m1'><table cellpadding="3"><tr><td><img src="some.jpg" width="45" /></td><td>Some text</td></tr></table></div>
<br />
</div>
</div>
<!-- /box -->
<!-- box -->
<div class="boxuniwersal_header">Table 2</div>
<img src="img/boxuniwersal_top.gif" width="210" height="18" alt="" style="margin-top: 5px" />
<div class="boxuniwersal_content">
<div class="boxuniwersal_subcontent">
<div class='menu_m1'><table cellpadding="3"><tr><td><img src="some2.jpg" width="45" /></td><td>Some text2</td></tr></table></div>
<br />
</div>
</div>
$domXPath = new DOMXPath($domDocument);
$results = $domXPath->query("//div/div/table/tr/td/a|//table//tr/td//a"); //querying domdocument
foreach($results as $result)
{
$links[]=$result->getAttribute("href");
}
This code returns all links. I want to grab only links from Table1. Is it possible?
Your main problem is just tuning the XPath expression to select the right XML.
If you change your XPath to
//div[text()="Table 1"]/following-sibling::div[1]//table//a
What this does is first find the <div> element whose text is the one your after.
The following-sibling::div[1] part will look at the first <div> element at the same level as the <div> element already selected (this is the one where the <table> is).
The last part just looks for all <a> elements within the enclosing <table>.

PHP echo HTML from MySQL Inserting as Sibling not Child

I have a PHP page that is pulling data from a MySQL table. One field (content) contains HTML to populate on the page. When trying to insert the record inside of a paragraph tag, the result starts and ends with a paragraph tag, but does not insert correctly as a child element, but as a sibling. Can anyone see the issue here?
HTML/PHP
<?php
foreach ($pages as $page) {
?>
<div class="slide" id="about-content">
<h1 class="pic-title"><?=$page->title;?></h1>
<p class="pic-caption overlay">
<?=$page->content;?>
</p>
</div>
<?php
}
?>
Output HTML:
<div class="fp-tableCell" style="height:419px;">
<h1 class="pic-title" style="margin-left: 25px;">Splash Page 2</h1>
<p class="pic-caption overlay" style="display: block;">
</p>
<p>dfajdfn<strong>akdjfnas</strong></p>
<p></p>
</div>
MySQL Data:
Title: Splash Page 2
Content: <p>dfajdfn<strong>akdjfnas</strong></p>
I can't seem to trace this one. Thanks!
That's to be expected, you're inserting a <p> inside another <p>. You can NOT nest paragraphs, and starting a new paragraph while inside a paragraph will terminate the earlier one.
e.g.
<p>foo
<p>bar
<p>baz
will internally generate
<p>foo</p>
<p>bar</p>
<p>baz</p>
In the DOM tree.
You should probably switch to using <div> instead:
<div class="pic-caption overlay">
^^^---
<?=$page->content;?>
</div>
^^^

regular expression to remove a div

I have a file like:
<div clas='dsfdsf'> this is first div </div>
<div clas='dsfdsf'> this is second div </div>
<div class="remove">
<table>
<thead>
<tr>
<th colspan="2">Mehr zum Thema</th>
</tr>
</thead>
<tbody>
<tr> this is tr</tr>
<tr> this row no 2 </tr>
</tbody>
</table>
</div>
<div clas='sasas'> this is last div </div>
I have get this file content in a variable like this:
$Cont = file_get_contents('myfile');
Now I want to replace div with class name 'remove' by preg_replace. I have tried this:
$patterns = "%<div class='remove'>(.+?)</div>%";
$strPageSource = preg_replace($patterns, '', $Cont);
It did not work. What should be the correct regular expression for this replace?
Try this code.
preg_replace("/<div class='remove'>(.*?)<\/div >/i", "<div class="newClass">Newthings</div> ", $Cont);
As it has been stated in the comments, you should not be using regex to parse HTML. Because there's no sane way for you to extract that <div> if there're other nested <div>'s inside. I.e.
<div clas='dsfdsf'> this is second div </div>
<div class="remove">
some text <div>nested div</div> more text and some elements<br />
</div>
What you want to do is find the location of your <div class="remove"> and then advance through the HTML (parse it) in the following manner
1) set $nesting_counter = 0
2) proceed through HTML until you encounter either <div> or </div>
a) if found <div>
$nesting_counter++ and go to point 2)
b) if found </div>
if $nesting_counter > 0
$nesting_counter-- and go to point 2)
else
you've found the closing tag for your `<div class="remove">`. remember current position and just remove that substring.

How to parse HTML with nested tags using Simple DOM Parser?

I have a HTML file that I'm trying to parse. It has a bunch of DIVs like this:
<div class="doc-overview">
<h2>Description</h2>
<div id="doc-description-container" class="" style="max-height: 605px;">
<div class="doc-description toggle-overflow-contents" data-collapsed-height="200">
<div id="doc-original-text">
Content of the div without paragraph tags.
<p>Content from the first paragraph </p>
<p>Content from the second paragraph</p>
<p>Content from the third paragraph</p>
</div>
</div>
<div class="doc-description-overflow"></div>
</div>
I tried this:
foreach($html->find('div[id=doc-original-text]') as $div) {
echo $div->innertext;
}
You notice that I directly find the doc-original-text but I also tried to parse from outer divs to inner divs.
Try This,
foreach($html->find('div#doc-original-text') as $div) {
echo $div->innertext;
}

Categories