PHP dom parsing - php

I'm trying to get the values of the following table. I tried both curl/regex (I know it's not recommended) and DOM separately, but wasn't able to get the values properly.
There are multiple rows in the page, so I'll need to use a foreach. I need an exact match of the structure below.
<tr>
<td width="75" style="NS">
<img src="NS" width="64" alt="INEEDTHISVALUE">
</td>
<td style="NS">
NS
</td>
<td style="NS">INEEDTHISVALUETOO</td>
</tr>
NS = Non-static values. They change for each td and a since it's a colored (inline css) table. They may contain special characters like ; / or numbers/alphabetical characters.
I'm using simple_html_dom class which can be found here : http://htmlparsing.com/php.html
I'm using the code below to get all td's, but I need more specific output (I included the table row above)
What I've tried so far :
$html = file_get_html("URL");
foreach($html->find('td') as $td) {
echo $td."<br>";
}
REGEX & CURL
$site = "URL";
$ch = curl_init();
$hc = "YahooSeeker-Testing/v3.9 (compatible; Mozilla 4.0; MSIE 5.5; Yahoo! Search - Web Search)";
curl_setopt($ch, CURLOPT_REFERER, 'http://www.google.com');
curl_setopt($ch, CURLOPT_URL, $site);
curl_setopt($ch, CURLOPT_USERAGENT, $hc);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$site = curl_exec($ch);
curl_close($ch);
preg_match_all('#<tr><td width="75" style="(.*?)"><img src="/folder/link/(.*?)" width="64" alt="(.*?)"></td><td style="(.*?)">(.*?)</td><td style="(.*?)">(.*?)</td></tr>#', $site, $arr);
var_dump($arr); // returns empty array, WHY?

You can do it like this without a library:
$results = array();
$doc = new DOMDocument();
$doc->loadHTML($site);
$xpath = new DOMXPath($doc);
foreach ($xpath->query('//tr') as $tr) {
$results[] = array(
'img_alt' => $xpath->query('td[1]/img', $tr)->item(0)->getAttribute('alt'),
'td_text' => $xpath->query('td[last()]', $tr)->item(0)->nodeValue
);
}
print_r($results);
It will give you:
Array
(
[0] => Array
(
[img_alt] => INEEDTHISVALUE 1
[td_text] => INEEDTHISVALUETOO 1
)
[1] => Array
(
[img_alt] => INEEDTHISVALUE 2
[td_text] => INEEDTHISVALUETOO 2
)
)
Relevant documentation: PHP: DOMXPath::query

Related

How to get a specified row using cUrl PHP

Hey guys I use curl to communicate web external server, but the type of response is html, I was able to convert it to json code (more than 4000 row) but I have no idea how to get specified row which contains my result. Any idea ?
Here is my cUrl code :
require_once('getJson.php');
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://www.reputationauthority.org/domain_lookup.php?ip=website.com&Submit.x=9&Submit.y=5&Submit=Search');
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5);
curl_setopt($ch, CURLOPT_TIMEOUT, 5);
$data = curl_exec($ch);
$httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
$data = '<<<EOF'.$data.'EOF';
$json = new GetJson();
header("Content-Type: text/plain");
$res = json_encode($json->html_to_obj($data), JSON_PRETTY_PRINT);
$myArray = json_decode($res,true);
For getJson.php
class GetJson{
function html_to_obj($html) {
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
return $this->element_to_obj($dom->documentElement);
}
function element_to_obj($element) {
if ($element->nodeType == XML_ELEMENT_NODE){
$obj = array( "tag" => $element->tagName );
foreach ($element->attributes as $attribute) {
$obj[$attribute->name] = $attribute->value;
}
foreach ($element->childNodes as $subElement) {
if ($subElement->nodeType == XML_TEXT_NODE) {
$obj["html"] = $subElement->wholeText;
}
else {
$obj["children"][] = $this->element_to_obj($subElement);
}
}
return $obj;
}
}
}
My idea is instead of Browsing rows to achieve lign 2175 (doing something like : $data['children'][2]['children'][7]['children'][3]['children'][1]['children'][1]['children'][0]['children'][1]['children'][0]['children'][1]['children'][2]['children'][0]['children'][0]['html'] is not a good idea to me), I want to go directly to it.
If the HTML being returned has a consistent structure every time, and you just want one particular value from one part of it, you may be able to use regular expressions to parse the HTML and find the part you need. This is an alternative you trying to put the whole thing into an array. I have used this technique before to parse a HTML document and find a specific item. Here's a simple example. You will need to adapt it to your needs, since you haven't specified the exact nature of the data you're seeking. You may need to go down several levels of parsing to find the right bit:
$data = curl_exec($ch);
//Split the output into an array that we can loop through line by line
$array = preg_split('/\n/',$data);
//For each line in the output
foreach ($array as $element)
{
//See if the line contains a hyperlink
if (preg_match("/<a href/", "$element"))
{
...[do something here, e.g. store the data retrieved, or do more matching to find something within it]...
}
}

Extract parts of a php array to client side

So I'm trying to centralize products in one central php file and have my client side php just request info so I only have to edit the central php file to add and remove products
my server side
$varProduct= (
// [0] [1] [2] [3 4 5 6 7] [8]
array("Title" , 0001 , 100, 0,0,1,1,0, "/womens/tops/s/2.png", "/womens/tops/s/2.jpg", "/womens/tops/s/2.jpg", 50 )
)
In my html client side I want to display the title, the price [2] and the url [8]
basically
for(i=o, i< $varProduct.length(), i++){
//display $varProduct[i][0];
//display the Image for $varProduct[i][8];
//display $varProduct[i][2];
}
how can I put values in my server side file to my client side in within html tags? I need to display them inline will I be able to format the variables?
Try something like this
<?php
for ($i = 0; $i < count($varProduct); $i++) {
//full path -- then post pram
$return = sendPostData("http://stackoverflow.com/", array('parm1' => $varProduct[$i][0], 'parm2' => $varProduct[$i][0]));
print_r($return);
}
?>
<?php
//send data function
function sendPostData($url, Array $post) {
$data = "";
foreach ($post as $key => $row) {
$row = urlencode($row); //fix the url encoding
$key = urlencode($key); //fix the url encoding
if ($data == "") {
$data .="$key=$row";
} else {
$data .="&$key=$row";
}
}
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, "POST");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $data);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_POST, 1);
$result = curl_exec($ch);
curl_close($ch); // Seems like good practice
return $result;
}
?>

Problems with multiple attributes while using PHP Simple HTML DOM

I use this code for getting elements of left navigation bar:
function parseInit($url) {
$ch = curl_init();
$timeout = 0;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 2);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$data = parseInit("https://www.smile-dental.de/index.php");
$data = preg_replace('/<(d[ldt])( |>)/smi', '<div data-type="$1"$2', $data);
$data = preg_replace('/<\/d[ldt]>/smi', '</div>', $data);
$html = new simple_html_dom();
$html = $html->load($data);
But faced with such problem.
For example, if I use such syntax for getting elements: $html->find("div[data-type=dd].level2"), then I get ALL elements with data attributes DT, DD, DL and class name LEVEL2. If I use another syntax: $html->find("div.level2[data-type=dd]"), then I get ALL elements with data attribute DD, but with class names LEVEL1, LEVEL2 and LEVEL3 etc..
Could you explain me what the problem is? Thanks in advance!
P.S.: All DT, DL and DD elements was changed with regexp to the DIV elements with appropriate data attributes, because this parser incorrectly counts the number of these elements.
REGEXes are not made to manipulate HTML, DOM parsers are... And simple_html_dom you're using can do it easily...
The following code will do what you want just fine (check comments):
$data = parseInit("https://www.smile-dental.de/index.php");
// Create a DOM object
$html = new simple_html_dom();
$html = $html->load($data);
// Find all tags to replace
$nodes = $html->find('td, dd, dl');
// Loop through every node and make the wanted changes
foreach ($nodes as $key => $node) {
// Get the original tag's name
$originalTag = $node->tag;
// Replace it with the new tag
$node->tag = 'div';
// Set a new attribute with the original tag's name
$node->{'data-type'} = $originalTag;
}
// Clear DOM variable
$html->clear();
unset($html);
Here's is it in action
Now, for multiple attributes filtering, you can use either of the following methods:
foreach ( $html->find("div.level2") as $key => $node) {
if ( $node->{'data-type'} == 'dt' ) {
# code...
}
}
OR (courtesy to h0tw1r3):
// array containing all the filtered nodes
$dts = array_filter($html->find('div.level2'), function($node){return $node->{'data-type'} == 'dt';});
Please read the MANUAL for more details...

Why does this JSON array appear empty?

I am using a webservice to synchronize products from our distributor. As a first step, I have built a table to display the data to verify that everything is working properly. I have been able to successfully retrieve product descriptions using the following function:
function get_description($api_key, $item){
$url = 'http://www.stl-distribution.com/webservices/json/GetProductDescription.php';
$post_vars = 'api_key='.$api_key.'&item='.$item;
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_POST,1);
curl_setopt($ch, CURLOPT_POSTFIELDS,$post_vars);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION,1);
curl_setopt($ch, CURLOPT_HEADER,0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
$return_data = curl_exec($ch);
$json_array = json_decode($return_data,true);
return $json_array;
}
I am displaying the information using a foreach loop as follows:
<table>
<tr>
<th>ISBN</th>
<th>Description:</th>
<th>Categories:</th>
<th>Options:</th>
</tr>
<?php foreach($product_ISBNs as $item) : ?>
<tr>
<td class="center"><?php echo $item[0]?></td>
<td class="center"><?php $item_description = get_description($api_key,$item[0]); echo $item_description['description'];?>
<td class="center"><?php $item_data = get_meta_data($api_key,$item[0]); echo $item_data['product_type']; ?>
</tr>
<?php endforeach?>
</table>
While I have successfully been able to retrieve the description using the get_description() function, I consistently get the following error from the get_meta_data() function:
Notice: Undefined index: product_type in C:\xampp\htdocs\STLImport\view\user_form.php on line 21
The code for the get_meta_data() function is as followss:
function get_meta_data($api_key, $item){
$url = 'http://www.stl-distribution.com/webservices/json/GetProductMetaBasic.php';
$post_vars = 'api_key='.$api_key.'&item='.$item;
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_POST,1);
curl_setopt($ch, CURLOPT_POSTFIELDS,$post_vars);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION,1);
curl_setopt($ch, CURLOPT_HEADER,0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
$return_data = curl_exec($ch);
$json_array = json_decode($return_data,true);
return $json_array;
}
When I do a print_r() immediately before the return $json_array; statement I get the following error:
Array ( [0] => Array ( [error] => "" not found ) )
According to our distributor's website, our requests are being sent and received. Every time I refresh the page my usage stats go up. So, it appears that either no data is being returned, or I am not referencing it correctly. I know that these products exist in the database because it is returning the description. Therefore, I must not be referencing it correctly but I can't find where my error is. I have tried referencing it using various combinations to no avail. The webservice documentation gives this example for how the data should be returned:
{
"9781434768513":
{
"isbn13":"9781434768513",
"isbn":"1434768511",
"upc":"000000912992",
"title":"Crazy Love",
"subtitle":"Overwhelmed By A Relentless God",
"contributor1":"Chan, Francis",
"contributor2":"Yankoski, Danae",
"contributor3":"",
"vendor":"David C. Cook",
"release_date":"20080430",
"retail":"14.99",
"binding":"Paperback",
"product_type":"Books",
"category1":"Christian Living",
"category2":"",
"category3":"",
"grade_level_start":"",
"grade_level_end":"",
"inventory_updated":"20100825 11:25",
"tn_available":993,
"tn_onorder":240,
"nv_available":735,
"nv_onorder":0,
"image_small":"http:\/\/www.stl-distribution.com\/covers\/7814\/sm-9781434768513.jpg",
"image_medium":"http:\/\/www.stl-distribution.com\/covers\/7814\/md-9781434768513.jpg",
"image_large":"http:\/\/www.stl-distribution.com\/covers\/7814\/lg-9781434768513.jpg"
}
It seems that my get_meta_data() function had an error in it. My API provider informed me that this line
$post_vars = 'api_key='.$api_key.'&item='.$item;
Should have been
$post_vars = 'api_key='.$api_key.'&items='.$item;
Notice the addition of an 's'. So, $json_array appeared empty because it was empty. My other function worked because it did not require the 's' as GetProductMetaBasic did.
In addition, this line
<td class="center"><?php $item_data = get_meta_data($api_key,$item[0]); echo $item_data['product_type']; ?>
Should have been
<td class="center"><?php $item_data = get_meta_data($api_key,$item[0]); echo $item_data[$item[0]]['product_type']; ?></td>
because, as the sample data implied, the array was multidimensional. The desired value (product_type) was contained within the $item[0] array. Therefore, $item_data['product_type'] did not exist.

php cURL. preg_match , extract text from xhtml

I'm trying to extract the price from the bellow html page/link using php cURL and preg_match . Basically I'm expecting for this code to output 4,550 but for some reasons I get
Notice: Undefined offset: 1 in C:\wamp\www\test.php on line 22
I think that the pattern is correct because if I put the html itself in a variable and escape the "" it works ! .
Also if I output (echo $result;) it displays the html properly grabbed from foxtons website so I just can't figure it out why the whole thing doesn't work . I need to make this work and also I would appreciate if you would tell me why is that notice generated and why my current script doesn't work.
$url = "http://www.foxtons.co.uk/search?bedrooms_from=0&property_id=727717";
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch,CURLOPT_RETURNTRANSFER, 1);
$result = curl_exec($ch);
curl_exec($ch);
curl_close($ch);
$result2 = str_replace('"', '\"', $result);
$tagname1= ");</script>
";
$tagname2= "</noscript>
per month</a>";
$pattern = "/$tagname1(.*?)$tagname2/";
preg_match($pattern, $result, $matches);
$prices = $matches[1];
print_r($prices);
?>
I rewrote the script a bit to account for more than 1 <noscript> on the page. You needed to use preg_match_all which will look for all the matches not just stop at the first one.
$url = "http://www.foxtons.co.uk/search?bedrooms_from=0&property_id=727717";
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch,CURLOPT_RETURNTRANSFER, 1);
$result = curl_exec($ch);
curl_exec($ch);
curl_close($ch);
preg_match_all("/<noscript>(.*)<\/noscript>/", $result, $matches);
print_r($matches);
Outputs
Array
(
[0] => Array
(
[0] => £1,050
[1] => 4,550
)
[1] => Array
(
[0] => £1,050
[1] => 4,550
)
)
I tried this on my box and it worked - let me know if it worked for you
Don't use REGEX to parse html, use an html dom parser instead, like PHP Simple HTML DOM Parser
include("simple_html_dom.php") ;
$html = file_get_html("http://www.foxtons.co.uk/search?bedrooms_from=0&property_id=727717");
foreach($html->find('noscript') as $noscript)
{
echo $noscript->innertext."<br>";
}
echo's:
£1,600
6,934
£1,500
6,500
£1,350
5,850
£950
4,117
£925
4,009
£850
3,684
£795
3,445
£795
3,445
£775
3,359
£750
3,250

Categories