Extract two string in html code - php

I have a HTML table which has the following structure:
<tr>
<td class='tablesortcolumn'>atest</td>
<td >Kunde</td>
<td >email#example.com</td>
<td align="right"><img src="images/iconedit.gif" border="0"/> <img src="images/pixel.gif" width="2" height="1" border="0"/> <img src="images/icontrash.gif" border="0"/></td>
</tr>
There are hundreds of these tr blocks.
I want to extract atest and email#example.com
I tried the following:
$document = new DOMDocument();
$document->loadHTML($data);
$selector = new DOMXPath($document);
$elements = $selector->query("//*[contains(#class, 'tablesortcolumn')]");
foreach($elements as $element) {
$text = $element->nodeValue;
print($text);
print('<br>');
}
Extracting atest is no problem, because I can get the element with the tablesortcolumn class. How can I get the email address?
I cannot simply use //table/tr/td/a because there are other elements on the website which are structured like this. So I need to get it by choosing an empty href tag. I already tried //table/tr/td/a[contains(#href, '')] but it returns the same as with //table/tr/td/a
Does anyone have an idea how to solve this?

can you try running an xpath that contains the string #? It seems unlikely that this would be used for anything else.
so something like this might work
//*[text()[contains(.,'#')]]

The following XPath expression does exactly what you want
//*[#class = 'tablesortcolumn' or contains(text(),'#')]
using the input document you have shown will yield (individual results separated by -------------):
<td class="tablesortcolumn">atest</td>
-----------------------
email#example.com

If you are looking for an email field, you could use a regex. Here is an article that could be useful.
EDIT
According to Nisse Engström, I will put the interesting part of the article here in case the blog goes down. Thanks for the advice.
// Supress XML parsing errors (this is needed to parse Wikipedia's XHTML)
libxml_use_internal_errors(true);
// Load the PHP Wikipedia article
$domDoc = new DOMDocument();
$domDoc->load('http://en.wikipedia.org/wiki/PHP');
// Create XPath object and register the XHTML namespace
$xPath = new DOMXPath($domDoc);
$xPath->registerNamespace('html', 'http://www.w3.org/1999/xhtml');
// Register the PHP namespace if you want to call PHP functions
$xPath->registerNamespace('php', 'http://php.net/xpath');
// Register preg_match to be available in XPath queries
//
// You can also pass an array to register multiple functions, or call
// registerPhpFunctions() with no parameters to register all PHP functions
$xPath->registerPhpFunctions('preg_match');
// Find all external links in the article
$regex = '#^http://[^/]+(?<!wikipedia.org)/#';
$links = $xPath->query("//html:a[ php:functionString('preg_match', '$regex', #href) > 0 ]");
// Print out matched entries
echo "Found " . (int) $links->length . " external linksnn";
foreach($links as $linkDom) { /* #var $entry DOMElement */
$link = simplexml_import_dom($linkDom);
$desc = (string) $link;
$href = (string) $link['href'];
echo " - ";
if ($desc && $desc != $href) {
echo "$desc: ";
}
echo "$href\n";
}

If you are using Chrome, you can test your XPath queries in the console, like this :
$x("//*[contains(#class, 'tablesortcolumn')]")

Related

using preg_match for get value between two tag

i try to get value between html tag :
preg_match(/<span class=\"value\">(.*)<\/span>/i', $file_string, $title);
html :
<p class="upc">
<label>UPC/EAN/ISBN:</label>
<span class="value">746775319571</span>
</p>
You do not parse HTML with regular expressions, but use php DOM extension instead:
$html = '<p class="upc">
<label>UPC/EAN/ISBN:</label>
<span class="value">746775319571</span>
</p>';
$dom = new DOMDocument();
$dom->loadHTML($html);
$spans = $dom->getElementsByTagName('span');
if ($spans->length > 0) {
echo $spans->item(0)->nodeValue; // outputs 746775319571
}
Online demo: http://ideone.com/9W8gsv
If having a particular class value is a required constraint, then you can either perform the check manually by iterating over $spans and checking class attribute (using DOMElement::getAttributeNode). Or using DOMXPath instead.
Either way, I'm leaving it as a homework, because we all know how satisfactory it is to solve issues yourself!

PHP - SimpleHTMLDom - How to access table elements?

I'm trying to grab the artists for every album release based on metacritic using simplehtmldom - http://www.metacritic.com/browse/albums/release-date/coming-soon/date?view=detailed
The artist names are contained within seperate td elements which have the class name of artistName
What I've managed to figure out so far is
$html = file_get_html('http://www.metacritic.com/browse/albums/release-date/coming-soon/date?view=detailed');
$es = $html->find('table.musicTable td');
Where do I go from here? I'm finding examples and the documentation a bit confusing. Any help will be really appreciated.
I suggest to use the PHP:DOM extension
DOM manual here
which is a very powerful tools for parsing and manipulating XML or HTML documents
for your case you can do like this
<?php
$html = file_get_contents('http://www.metacritic.com/browse/albums/release-date/coming-soon/date?view=detailed');
$doc = new DOMDocument();
$doc->loadHTML($html);
$searchNode = $doc->getElementsByTagName("table");
foreach( $searchNode as $searchNode )
{
//do your things here
}
?>
or even can use xpath to query the document node
Xpath usage
Every name is contained into an anchor inside a <td class="artistName">, that's all what needed in this case to create the following code:
$url = "http://www.metacritic.com/browse/albums/release-date/coming-soon/date?view=detailed";
//Create a DOM object
$html = new simple_html_dom();
// Load HTML from a string
$html->load_file($url);
// Find the anchor containing the name inside all "td.artistName" elements
$anchors = $html->find('td.artistName a');
// loop through all found anchors and print the content
foreach($anchors as $anchor) {
$name = $anchor->plaintext;
echo $name . "<br>";
}
// Clear DOM object
$html->clear();
unset($html);
OUTPUT
Peter Gabriel
Stephen Malkmus & The Jicks
TOY
Black Knights
Broken Bells
Bruce Springsteen
David Broza
Eskimo Callboy
...
Working DEMO
Please read the MANUAL for more examples and details

PHP Simple HTML Dom - tag attrib

i'm trying try grab the plain text from within <font size="3" color="blue"> ... its not picking up the font tag, although it does work if I do "font", 3 but there are a lot of font tags in the site and i'd like to make the search a bit more specific. is it possible to have multiple attribs on a tag?
<?php
include('simple_html_dom.php');
$html = new simple_html_dom();
$html = file_get_html('http://cwheel.domain.com/');
##### <font size="3" color="blue">Certified Genuine</font>
$element = $html->find("font[size=3][color=blue]", 0);
echo $element-> plaintext . '<br>';
$html->clear();
?>
I dont know Simple_html_dom. But it seems the query you are trying to pass is an xpath query. In that case you need to use prefix attributes with #. Also you need to prefix the whole query with // to make sure it searches for any font tag that is in any level deep. Final query should look something like this.
//font[#size=3][#color=blue]
Using DOMDocument and DOMXPath it works pretty good.
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$fonts = $xpath->query('font[#size="3" ][ #color="blue"]');
foreach($fonts as $font){
echo $font->textContent. "\n";
}

Strip an entire block of html based on class or id with php

I have the following php function which is supposed to remove a block of html tag based on a given classname or id. I got this function at http://www.katcode.com/php-html-parsing-extracting-and-removing-html-tag-of-specific-class-from-string/
This function works as it should but seems to have problems when we have nested tags. In the example below i'm trying to remove the entire div block that has class 'two'.
This function seems to have problems with nested tags. It's not removing the div block properly. It's having problems figuring out beginning and end of the block. How can i rework this function remove an entire tag regardless of how many nested elements it contains. I'm open to other php suggestions. I can easily do this with jQuery, but i'm looking for a php server side solution.
html looks like this
<div class="test">
<div>testing1</div>
<div class="two">
<div>testing3</div>
<div>testing3</div>
</div>
<div>testing3</div>
<div>testing4</div>
</div>
php
<?php
$x = '<div class="test"><div>testing1</div><div class="two"><div>testing3</div><div>testing3</div></div><div>testing3</div><div>testing4</div></div>';
function removeTag($str,$id,$start_tag,$end_tag){
while(($pos_srch = strpos($str,$id))!==false){
$beg = substr($str,0,$pos_srch);
$pos_start_tag = strrpos($beg,$start_tag);
$beg = substr($beg,0,$pos_start_tag);
$end = substr($str,$pos_srch);
$end_tag_len = strlen($end_tag);
$pos_end_tag = strpos($end,$end_tag);
$end = substr($end,$pos_end_tag+$end_tag_len);
$str = $beg.$end;
}
return $str;
}
echo removeTag($x,'two','<div','/div>');
?>
Not tested but try something like:
$doc = new DOMDocument();
$doc->loadHTML($x);
$xpath = new DOMXPath($doc);
$query = "//div[contains(#class, 'two')]";
$oldnodes = $xpath->query($query);
foreach ($oldnodes as $node) {
$fragment = $doc->createDocumentFragment();
while($node->childNodes->length > 0) {
$fragment->appendChild($node->childNodes->item(0));
}
$node->parentNode->replaceChild($fragment, $node);
}
echo $doc->saveHTML();
Hope it helps
html should probably never be parsed with php that way.
use phps domdocument class to open the html as an object. you can then use domdocument methods to search the document for the block you are looking for (xpath), loop through the xpath results and remove them, and then resave the document in text form.

simplehtmldom php: How do you search for one thing or another

I want to scrape some html with simple html dom in php. I have a bunch of tags containing tags. The tags I want alternate between bgcolor=#ffffff and bgcolor=#cccccc. There are some tags that have other bg colors.
I want to get all the code in each tag that has either bgcolor=#ffffff or bgcolor=#cccccc. I can't just use $html->find('tr') as there are other tags that I don't want to find.
Any help would be appreciated.
you can use simplehtmldom too
this is my solution for your problem
<?php
include_once "simple_html_dom.php";
// the html code example
$html = '<table>
<tr bgcolor="#ffffff"><td>1</td></tr>
<tr bgcolor="#cccccc"><td>2</td></tr>
<tr bgcolor="#ffffff"><td>3</td></tr>
</table>';
// in this case I load the html code via string
$code = str_get_html($html);
// find elem by attribute
$trs = $code -> find('tr[bgcolor=#ffffff]');
foreach($trs as $tr){
echo $tr -> innertext;
}
$trs = $code -> find('tr[bgcolor=#cccccc]');
foreach($trs as $tr){
echo $tr -> innertext;
}
?>
You could load the DOM into a simplexml class and then use xpath, like so:
$xml = simplexml_import_dom($simple_html_dom);
$goodies = $xml -> xpath('//[#bgcolor = "#ffffff"] | //[#bgcolor = "#cccccc"]');
you might even be able to put that OR syntax within the same set of brackets, but I'd need to double check.
Update:
Sorry, I thought you were talking about the DOM extension. I just looked up simpledomhtml, and it appears that its find feature is loosely based on XPath. why not just do:
$goodies = $html -> find('[bgcolor=#ffffff], [bgcolor="#cccccc]');

Categories