hey guys,
a curl function returns a string $widget that contains regular html -> two divs where the first div holds a table with various values inside of <td>'s.
i wonder what's the easiest and best way for me to extract only all the values inside of the <td>'s so i have blank values without the remaining html.
any idea what the pattern for the preg_match should look like?
thank you.
Regex is not a suitable solution. You're better off loading it up in a DOMDocument and parsing it.
You're betting off using a DOM parser for that task:
$html = <<<HTML
<div>
<table>
<tr>
<td>foo</td>
<td>bar</td>
</tr>
<tr>
<td>hello</td>
<td>world</td>
</tr>
</table>
</div>
<div>
Something irrelevant
</div>
HTML;
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$tds = $xpath->query('//div/table/tr/td');
foreach ($tds as $cell) {
echo "{$cell->textContent}\n";
}
Would output:
foo
bar
hello
world
You shouldn't use regexps to parse HTML. Use DOM and XPath instead. Here's an example:
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$nodes = $xpath->query('//td');
$result = array();
foreach ($nodes as $node) {
$result[] = $node->nodeValue;
}
// $result holds the values of the tds
Only if you have very limited, well-defined HTML can you expect to parse it with regular expressions. The highest ranked SO answer of all time addresses this issue.
He comes ...
Related
Within a curl request I have a html table that has the below structure. I now want to extract only table rows that contain a span element with the empty class and not the ones with the class="subcomponent".
I successfully tried Xpath to find the elements with the empty class but how to do I get the entire <tr> or even better specific <td> nodes that contain Version and Partnumber.
Thanks in advance.
<table>
...
<tbody>
<tr>
<td></td>
<td></td>
<td>
<span class="">Product</span>
</td>
<td>Version</td>
<td>Partnumber</td>
</tr>
<tr>
<td></td>
<td></td>
<td>
<span class="subcomponent">Component</span>
</td>
<td>Version</td>
<td>Partnumber</td>
</tr>
</tbody>
My PHP code
$doc = new DOMdocument();
libxml_use_internal_errors(true);
$doc->loadHTML($page);
$doc->saveHTML();
$xpath = new DOMXpath($doc);
$query ='//span[#class=""]';
$entries = $xpath->query($query);
foreach ($entries as $entry) {
echo $entry->C14N();
}
To access the table rows themselves using SimpleXML, you can use the following:
$sxml = simplexml_load_string('<table>...</table>');
$rows = $sxml->xpath('//tr[td/span[#class=""]]');
foreach ($rows as $row) {
echo "Version: ", $row->td[3], ", Partnumber: ", $row->td[4];
}
The XPath works by selecting all <tr> tags that have a child <td>, which itself has a child <span> with a blank class.
In the loop, you need to access the child cells of each row by number, since your sample doesn't indicate that they're labelled any other way. I'm assuming a table structure won't change too often though, so that should be fine.
See https://eval.in/860169 for an example.
Alternative DOMDocument Version
If you're fetching a full webpage, which won't necessarily be well-formed, you might need to use DOMDocument as you have in your first example. It's a bit less clean to access the child-elements, but something like the following will work:
$doc = new DOMdocument;
libxml_use_internal_errors(true);
$doc->loadHTML($page);
$xpath = new DOMXpath($doc);
$rows = $xpath->query('//tr[td/span[#class=""]]');
foreach ($rows as $row) {
$cells = $row->getElementsByTagName('td');
$version = $cells->item(3)->nodeValue;
$partNumber = $cells->item(4)->nodeValue;
echo "Version: {$version}, Part Number: {$partNumber}", PHP_EOL;
}
See https://eval.in/860217
I would use next XPath expression:
//td[text()="Version"] | //td[text()="Partnumber"]
Which gives me:
Element='<td>Version</td>'
Element='<td>Partnumber</td>'
Element='<td>Version</td>'
Element='<td>Partnumber</td>'
I was wondering if there was any way to use dom to select elements that have dynamic tags. All of the tags start with link_(some id).
Example:
<tr id="link_111111">something in here...</tr>
<tr id="link_222222">something in here...</tr>
<tr id="link_333333">something in here...</tr>
<tr id="link_444444">something in here...</tr>
<tr id="link_555555">something in here...</tr>
I was wondering if I could grab all the tr's that have the id with link_ because I don't have the specific id's, they are random.
You can use an XPath expression to achieve this:
//tr[starts-with(#id, "link")]
Example:
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('///tr[starts-with(#id, "link")]');
foreach ($nodes as $node) {
// Do whatever
}
Demo
DOM way using some string functions ...
$dom = new DOMDocument;
$dom->loadHTML($html); $tagK = 'link_';
foreach ($dom->getElementsByTagName('tr') as $tag) {
if (substr(strtolower($tag->getAttribute('id')),0,strlen($tagK))===$tagK) {
echo $tag->getAttribute('id').PHP_EOL;
}
}
Demo
Or if you want to have more flexible way and easy to Web Scrape.. I suggest you take a look at
https://github.com/fabpot/goutte which act as wrapper. that you can also used for clicking a link or submitting a form..
I made some tutorials using Goutte Class for Web Scraping.. Feel free to check it.
http://iapdesign.com/webdev/laravel-4-webdev/superb-web-scraping-tutorials-using-laravel-4/
I'm trying to write a document that will go through a webpage that was poorly coded and return the title element. However, the person who made the website I plan on scraping did not use ANY classes, simply just div elements. Heres the source webpage I'm trying to scrape:
<tbody>
<tr>
<td style = "...">
<div style = "...">
<div style = "...">TEXT I WANT</div>
</div>
</td>
</tr>
</tbody>
and when I copy the xpath in chrome I get this string:
/html/body/table/tbody/tr[2]/td[3]/table/tbody/tr[1]/td/div/div[3]
I'm having trouble figuring out where I put that string in an xpath query.
If not an xpath query maybe I should do a preg_match?
I tried this:
$location = '/html/body/table/tbody/tr[2]/td[3]/table/tbody/tr[1]/td/div/div[3]';
$html = file_get_contents($URL);
$doc = new DomDocument();
$doc->loadHtml($html);
$xpath = new DomXPath($doc);
// Now query the document:
foreach ($xpath->query($location) as $node) {
echo $node, "\n";
}
but nothing is printed to the page.
Thanks.
EDIT: Full sourse code here:
http://pastebin.com/K5tZ4dFH
EDIT2: Cleaner code screen shot: http://i.imgur.com/lWKheBy.png
From looking at your source, try the following:
$html = file_get_contents($URL);
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$nodes = $xpath->query("//div[contains(#style, 'left:20px')]");
foreach ($nodes as $node) {
echo $node->textContent;
}
It looks like you want the text just before the first </div>, so this regex will find that:
[^<>]+(?=<\/div>)
Here's a live demo.
Using regex in PHP how can I get the 108 from this tag?
<td class="registration">108</td>
Regex isn't a good solution for parsing HTML. Use a DOM Parser instead:
$str = '<td class="registration">108</td>';
$dom = new DOMDocument();
$dom->loadHTML($str);
$tds = $dom->getElementsByTagName('td');
foreach($tds as $td) {
echo $td->nodeValue;
}
Output:
108
Demo!
The above code loads up your HTML string using loadHTML() method, finds all the the <td> tags, loops through the tags, and then echoes the node value.
If you want to get only the specific class name, you can use an XPath:
$dom = new DOMDocument();
$dom->loadHTML($str);
$xpath = new DomXPath($dom);
// get the td tag with 'registration' class
$tds = $xpath->query("//*[contains(#class, 'registration')]");
foreach($tds as $td) {
echo $td->nodeValue;
}
Demo!
This is similar to the above code, except that it uses XPath to find the required tag. You can find more information about XPaths in the PHP manual documentation. This post should get you started.
If you wish to force regex, use the <td class=["']?registration["']?>(.*)</td> expression
I'm new to both HTML & PHP. I'm attempting to remove multiple DOM elements from a parsed HTML string.
For example:
<tbody>
<tr>
<td>I'd like to find and remove this text</td>
<td>& possibly this too</td>
<td>can you help?</td>
</tr>
</tbody>
Thanks in advance!
DOMDocument is much better for dealing with DOM manipulation (SimpleXML is good for just parsing):
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$textNodes = $xpath->query('//text()');
foreach ($textNodes as $node) {
$node->parentNode->removeChild($node);
}
Very simply if I were to remove lots of things from a string in PHP I would create an array of the parts to remove, loop through them and remove them one at a time from the big string.
Like
$html = "Loads of html blah blah blah blah......";
$array = array(
"Remove me",
"and me",
"remove me too"
);
foreach($array as $string){
str_replace($string, '', $html);
}
For more information on str_replace see http://php.net/manual/en/function.str-replace.php