php cURL. preg_match , extract text from xhtml

php cURL. preg_match , extract text from xhtml - php

I'm trying to extract the price from the bellow html page/link using php cURL and preg_match . Basically I'm expecting for this code to output 4,550 but for some reasons I get
Notice: Undefined offset: 1 in C:\wamp\www\test.php on line 22
I think that the pattern is correct because if I put the html itself in a variable and escape the "" it works ! .
Also if I output (echo $result;) it displays the html properly grabbed from foxtons website so I just can't figure it out why the whole thing doesn't work . I need to make this work and also I would appreciate if you would tell me why is that notice generated and why my current script doesn't work.
$url = "http://www.foxtons.co.uk/search?bedrooms_from=0&property_id=727717";
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch,CURLOPT_RETURNTRANSFER, 1);
$result = curl_exec($ch);
curl_exec($ch);
curl_close($ch);
$result2 = str_replace('"', '\"', $result);
$tagname1= ");</script>
";
$tagname2= "</noscript>
per month</a>";
$pattern = "/$tagname1(.*?)$tagname2/";
preg_match($pattern, $result, $matches);
$prices = $matches[1];
print_r($prices);
?>

I rewrote the script a bit to account for more than 1 <noscript> on the page. You needed to use preg_match_all which will look for all the matches not just stop at the first one.
$url = "http://www.foxtons.co.uk/search?bedrooms_from=0&property_id=727717";
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch,CURLOPT_RETURNTRANSFER, 1);
$result = curl_exec($ch);
curl_exec($ch);
curl_close($ch);
preg_match_all("/<noscript>(.*)<\/noscript>/", $result, $matches);
print_r($matches);
Outputs
Array
(
[0] => Array
(
[0] => £1,050
[1] => 4,550
)
[1] => Array
(
[0] => £1,050
[1] => 4,550
)
)
I tried this on my box and it worked - let me know if it worked for you

Don't use REGEX to parse html, use an html dom parser instead, like PHP Simple HTML DOM Parser
include("simple_html_dom.php") ;
$html = file_get_html("http://www.foxtons.co.uk/search?bedrooms_from=0&property_id=727717");
foreach($html->find('noscript') as $noscript)
{
echo $noscript->innertext."<br>";
}
echo's:
£1,600
6,934
£1,500
6,500
£1,350
5,850
£950
4,117
£925
4,009
£850
3,684
£795
3,445
£795
3,445
£775
3,359
£750
3,250

Related

PHP - json_decode - issues decoding string

I'm playing with the API from deepl.com that provides automatic translations. I call the API through cURL and I get a json string in return which appears to be fine but cannot be decoded by PHP for some reason.
Let me show first how I make the cURL call :-
$content = "bonjour <caption>monsieur</caption> madame";
$url = 'https://api.deepl.com/v2/translate';
$fields = array(
'text' => $content,
'target_lang' => $lg,
'tag_handling' => 'xml',
'ignore_tags' => 'caption',
'auth_key' => 'my_api_key');
$fields_string = "";
foreach($fields as $key=>$value)
{
$fields_string .= $key.'='.$value.'&';
}
rtrim($fields_string, '&');
$ch = curl_init();
curl_setopt($ch,CURLOPT_URL, $url);
curl_setopt($ch,CURLOPT_POST, count($fields));
curl_setopt($ch,CURLOPT_POSTFIELDS, $fields_string);
curl_setopt($ch, CURLOPT_HTTPHEADER, array('Content-Type: application/x-www-form-urlencoded','Content-Length: '. strlen($fields_string)));
$result = curl_exec($ch);
curl_close($ch);
If at this stage I do
echo $result;
I get:
{"translations":[{"detected_source_language":"FR","text":"Hola <caption>monsieur</caption> Señora"}]}
Which seems ok to me. Then if I use code below -
echo gettype($result);
I get "string" which is still ok but now, the following code fails:
$result = json_decode($result,true);
print_r($result);
The output is empty!
If I now do something like this:
$test = '{"translations":[{"detected_source_language":"FR","text":"Hola <caption>monsieur</caption> Señora"}]}';
echo gettype($test);
$test = json_decode($test,true);
print_r($test);
I get a perfectly fine array:
(
[translations] => Array
(
[0] => Array
(
[detected_source_language] => FR
[text] => Hola <caption>monsieur</caption> Señora
)
)
)
I did nothing else than copy/pasting the content from the API to a static variable and it works but coming from the API, it doesn't. It's like the data coming from the API is not understood by PHP.
Do you have any idea of what's wrong?
Thanks!
Laurent

I've had very similar issues before and for me the issue was with the encoding of the data returned from an API being unicode. I'm guessing when you do your copy/paste the string you hard-code ends up being a different encoding so it works fine when passed into json_decode.
The PHP docs specify json_decode only works with UTF-8 encoded strings:
http://php.net/manual/en/function.json-decode.php
You may be able to use mb_convert_encoding() to convert to UTF-8:
http://php.net/manual/en/function.mb-convert-encoding.php
Try this before calling json_decode:
$result = mb_convert_encoding($result, "UTF-8");

Make sure to set CURLOPT_RETURNTRANSFER to true. Only then will curl_exec actually return the response, otherwise it will output the response and return a boolean, indicating success or failure.
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$result = curl_exec($ch);
curl_close($ch);
if ($result !== false) {
$response = json_decode($result, true);
// do something with $response
} else {
// handle curl error
}

Like said #Eilert Hjelmeseth you have some special char in your JSON string => "Señora"
Another way to encode a string to UTF8: utf8_encode() :
$result = json_decode(utf8_encode($result),true);

How to get text within td element from remote webpage via PHP

I have a temperature monitor set up, and I would like to use some of the data for other things (cron jobs, etc). The data from the sensor can be accessed from our local network (192.168.123.123). The element in question is:
<td id="1E5410ECC9D90FC3-entity-0-measurement-0" class="">69.08</td>
<!-- I NEED THE 69.08 -->
I can't do it via ajax since I get the Allow-Access-Origin error (CORS).
I tried this:
$url = 'http://192.168.123.123';
$content = file_get_contents($url);
$first = explode( '<div id="1E5410ECC9D90FC3-entity-0-measurement-0">' , $content );
$second = explode("</div>" , $first[0] );
echo $second[0];
but I got this:
��UMS�0��+��$���94С�2����؋-�%#Ʉ�뻲���Bۓ%����ݷr��m4�yyF*_+ry���ӈP������S��|��&�ȵ�2���}��V�7ǜO��dz�[�� (�!�_2��$�/�p/ g�=B� D����<��1�#�=h���J�˨�'��I^ ��g7��=�=��^�0��ϔ����p�Q��L��I�%TF�Q�) ������;c��o$��a����g��mWr�ܹ��;�(��bE��O�i� ��y�҉)f=�6=�,2� �#I��s����>����kNƕt/W2^��# Xp�3^݅$ѵ��T U�ʲ�#f��db�ԁ%��b�`G|��D�{񠐏sι1�� ]#2ZH�(1;&�h8��^0er��3���D�Q�5B�u� ^!5X:�{a U\:߰0�~Ɍ�3+S�^1��qB:�g����C>�.�P~n��$\֢D����%J+�b�ELc�Gq���K �]��xV��j�[���Ԧ��nAɍ��<�ZT#���zc�Q(f܁�(~�^�ZKwk:8�·n>��(=�"aB)�Fl5�b]/�_�$���_��ɴ��9�H}��B [#�V�ԅp��r�g�A�j���2����Ju*������{�bY�,O4�����M��B�#�e���,� ��_֔���o����
How can I properly get the 'td' text within the specific div id?

You are trying to retrieve data from <td id="1E5410ECC9D90FC3-entity-0-measurement-0" class=""> not <div id="1E5410ECC9D90FC3-entity-0-measurement-0">, so not from a <div>, so just change it into:
$url = 'http://192.168.123.123';
$content = file_get_contents($url);
$first = explode( '<td id="1E5410ECC9D90FC3-entity-0-measurement-0">' , $content );
$second = explode("</td>" , $first[0] );
echo $second[0];
Or am I crazy?

Step 1:
I suggest using php's curl library to manage and configure your web request/response.
Using this mechanism allows you to better manage/control encoding, compression and encryption.
http://php.net/manual/en/book.curl.php
// create curl resource
$ch = curl_init();
// set url
curl_setopt($ch, CURLOPT_URL, "http://192.168.123.123");
//return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
// $output contains the output string
$output = curl_exec($ch);
// close curl resource to free up system resources
curl_close($ch);
Step 2:
Let's extract the details out of the returned response string from the web server. I suggest PHP's PCRE function preg_match to extract the needed data.
http://php.net/manual/en/ref.pcre.php
// Looking for <td id="1E5410ECC9D90FC3-entity-0-measurement-0" class="">69.08</td>
$pattern = '/id="1E5410ECC9D90FC3-entity-0-measurement-0".*>([\d]{1,2}?\.[\d]{1,2})<\//';
// run the regex match and collect the hit
preg_match($pattern, $output, $matches);
// print_r of the array
/*
Array
(
[0] => id="1E5410ECC9D90FC3-entity-0-measurement-0" class="">69.08</
[1] => 69.08
)
*/
// Print out the result to check
echo $matches[1];

foreach is saving only the first array even print_r printing the result correctly what am missing

I am building a function to download images from an external source, and then write the id of each product has been downloaded along with the path to json file,
The download loop is working correctly but when am trying to save the product id & image path to a text file only the first record is saved.
i tried to print the output before using the file_put_contents function and it turns out am getting the correct result. I am wondering why only the first result is saved not all of them ?
I spent almost all day tryinf to figure it out but no luck. any help will be appreciated.
here is my code:
$url = 'product-test2.json'; // path to JSON file
$data = file_get_contents($url);
$characters = json_decode($data, true);
$i = array_column ($characters , 'Id');
foreach ( $i as $value => $productid){
include 'login.php';
$url = "http://xxxx.com/api/products/image/";
$url .= implode($matches [0]); // get the token form login.php
$url .= '/' . $productid ; //product id
// echo $url . '<br>';
$opts = array ('http' => array (
'methos' => 'GET',
'header' => 'Content-typpe: application/json',
)
);
$context = stream_context_create ($opts);
$urlImage = file_get_contents ($url , false , $context);
$urlImage = substr_replace ($urlImage , "",-1);
$urlImage = substr ($urlImage , 1);
$fopenpath = 'C:\\xampp\htdocs\\test\\api\\images\\' ;
$fopenpath .= $productid . '.jpg';
$fp = fopen ( $fopenpath, 'w');
$c = curl_init ($urlImage);
curl_setopt ($c , CURLOPT_FILE , $fp);
curl_setopt ($c , CURLOPT_HEADER, 0);
curl_setopt ($c , CURLOPT_POST, false);
curl_setopt ($c , CURLOPT_SSL_VERIFYPEER, false);
$rawdata = curl_exec ($c);
fwrite ($fp, $rawdata);
fclose ($fp);
//
$result3= ["id" => $productid ,"image_path" => $fopenpath] ;
$image_file = "images.json";
$output = json_encode ($result3);
print_r ($output);
file_put_contents ($image_file , $output );
};
and here is the result from print_r ($result3);
{"id":2977,"image_path":"C:\\xampp\\htdocs\\test\\api\\images\\2977.jpg"} {"id":2981,"image_path":"C:\\xampp\\htdocs\\test\\api\\images\\2981.jpg"} {"id":3009,"image_path":"C:\\xampp\\htdocs\\test\\api\\images\\3009.jpg"} {"id":3018,"image_path":"C:\\xampp\\htdocs\\test\\api\\images\\3018.jpg"} {"id":11531,"image_path":"C:\\xampp\\htdocs\\test\\api\\images\\11531.jpg"}
as you can see the Print_r is correct all what i want to do is to save the output into file and currently here is the result from file_put_contents only one array :
{"id":11531,"image_path":"C:\\xampp\\htdocs\\test\\api\\images\\11531.jpg"}
can you please help and tell me what am doing wrong i already spend a lot of time on this :)

file_put_contents is the same as opening, writing, and closing the file, which doesn't really make sense to do in a foreach loop. If you really want to use it, I think you could add the FILE_APPEND flag, which will stop it from overwriting itself over and over and over.
An alternative would be to fopen before the foreach loop with the 'a' flag, which will create the file then keep appending lines instead of wiping out the file contents every loop. (And fclose after the foreach.)

PHP dom parsing

I'm trying to get the values of the following table. I tried both curl/regex (I know it's not recommended) and DOM separately, but wasn't able to get the values properly.
There are multiple rows in the page, so I'll need to use a foreach. I need an exact match of the structure below.
<tr>
<td width="75" style="NS">
<img src="NS" width="64" alt="INEEDTHISVALUE">
</td>
<td style="NS">
NS
</td>
<td style="NS">INEEDTHISVALUETOO</td>
</tr>
NS = Non-static values. They change for each td and a since it's a colored (inline css) table. They may contain special characters like ; / or numbers/alphabetical characters.
I'm using simple_html_dom class which can be found here : http://htmlparsing.com/php.html
I'm using the code below to get all td's, but I need more specific output (I included the table row above)
What I've tried so far :
$html = file_get_html("URL");
foreach($html->find('td') as $td) {
echo $td."<br>";
}
REGEX & CURL
$site = "URL";
$ch = curl_init();
$hc = "YahooSeeker-Testing/v3.9 (compatible; Mozilla 4.0; MSIE 5.5; Yahoo! Search - Web Search)";
curl_setopt($ch, CURLOPT_REFERER, 'http://www.google.com');
curl_setopt($ch, CURLOPT_URL, $site);
curl_setopt($ch, CURLOPT_USERAGENT, $hc);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$site = curl_exec($ch);
curl_close($ch);
preg_match_all('#<tr><td width="75" style="(.*?)"><img src="/folder/link/(.*?)" width="64" alt="(.*?)"></td><td style="(.*?)">(.*?)</td><td style="(.*?)">(.*?)</td></tr>#', $site, $arr);
var_dump($arr); // returns empty array, WHY?

You can do it like this without a library:
$results = array();
$doc = new DOMDocument();
$doc->loadHTML($site);
$xpath = new DOMXPath($doc);
foreach ($xpath->query('//tr') as $tr) {
$results[] = array(
'img_alt' => $xpath->query('td[1]/img', $tr)->item(0)->getAttribute('alt'),
'td_text' => $xpath->query('td[last()]', $tr)->item(0)->nodeValue
);
}
print_r($results);
It will give you:
Array
(
[0] => Array
(
[img_alt] => INEEDTHISVALUE 1
[td_text] => INEEDTHISVALUETOO 1
)
[1] => Array
(
[img_alt] => INEEDTHISVALUE 2
[td_text] => INEEDTHISVALUETOO 2
)
)
Relevant documentation: PHP: DOMXPath::query

Regex to find out specific part of an html page

i want a regex to find out the below lines from a set of codes.
The part that i want to find:---
-->Copy frame link\",\"url240\":\"http:\/\/cs534515v4.vk.me\/u163220668\/videos\/1c1b06aec9.240.mp4\",\"url360\":\"http:\/\/cs534515v4.vk.me\/u163220668\/videos\/1c1b06aec9.360.mp4\",\"jpg\"<--
This code form part if an html page and i want to retrieve only the part shown.I am writing the codes in php
My complete codes.....
<?php
set_time_limit(0);
function get_content_of_url($url){
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, true);
$content = curl_exec($ch);
curl_close($ch);
return $content;
}
$plyst = get_content_of_url("http://vk.com/video56612186_167113956");
preg_match('/link\\".*"jpg\\"/', $plyst , $matches);
var_dump($matches);
//preg_match('/http:\/\/[a-zA-Z0-9\\/-_.]+/', $matches[0][0], $id);
//start_script($id[0]);
?>

How about this.
$str = "video_get_current_url\":\"Copy frame link\",\"url240\":\"http:\\\/\\\/cs534515v4.vk.me\\\/u163220668\\\/videos\\\/1c1b06aec9.24‌0.mp4\",\"url360\":\"http:\\\/\\\/cs534515v4.vk.me\\\/u163220668\\\/videos\\\/1c1b06aec9.36‌0.mp4\",\"jpg\":\"http:\\\/\\\/cs534515.vk.me\\\/u163220668\\\/video\\\/l_8a5b0712.jpg\",\"‌ip_subm\":1,\"nologo";
preg_match('/\\"Copy\sframe.*"jpg\\"/is', $str, $matches);
var_dump($matches);
Output:
array(1) {
[0]=>
string(199) ""Copy frame link","url240":"http:\\/\\/cs534515v4.vk.me\\/u163220668\\/videos\\/1c1b06aec9.24‌0.mp4","url360":"http:\\/\\/cs534515v4.vk.me\\/u163220668\\/videos\\/1c1b06aec9.36‌0.mp4","jpg""
}
Edit:
And then, if you wanted to extract the video url's from that:
preg_match_all('/(https?:.*?\.mp4)/', $matches[0], $id);
//Then echo out the url's
foreach ($id[0] as $url) {
// the preg_replace strips out the double backslashes.
echo preg_replace('/\\\\/', '', $url)."<br />";
}
Output:
http://cs534515v4.vk.me/u163220668/videos/1c1b06aec9.24‌0.mp4
http://cs534515v4.vk.me/u163220668/videos/1c1b06aec9.36‌0.mp4
Working example: http://sandbox.onlinephpfunctions.com/code/329106d990fe8927a7670b9448770643afbd0865

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

php cURL. preg_match , extract text from xhtml - php

Related

PHP - json_decode - issues decoding string

How to get text within td element from remote webpage via PHP

foreach is saving only the first array even print_r printing the result correctly what am missing

PHP dom parsing

Regex to find out specific part of an html page

Categories

Resources