How do I extract text data from a web page? [duplicate]

How do I extract text data from a web page? [duplicate] - php

This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 9 years ago.
Okay, so I have the following function that grabs the web page I need:
function login2($url2) {
$fp = fopen("cookie.txt", "w");
fclose($fp);
$login2 = curl_init();
curl_setopt($login2, CURLOPT_COOKIEJAR, "cookies.txt");
curl_setopt($login2, CURLOPT_COOKIEFILE, "cookies.txt");
curl_setopt($login2, CURLOPT_TIMEOUT, 40000);
curl_setopt($login2, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($login2, CURLOPT_URL, $url2);
curl_setopt($login2, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
curl_setopt($login2, CURLOPT_FOLLOWLOCATION, false);
[...]
I then issue this to use the function:
echo login2("https://example.com/clue/holes.aspx");
This echoes the page I am requesting but I only want it to echo a specific piece of data from the HTML source. Here's the specific markup:
<h4>
<label id="cooling percent" for="symbol">*</label>
4.50
</h4>
The only piece of information I want is the figure, which in this specific example is 4.50.
So how can I go about this and make my cURL grab this and echo it instead of echoing the entire page?

You can solve this with XPath:
$html = login2('https://example.com/clue/holes.aspx');
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$value = $xpath->query('//label[#id="ctl00_ctl00_PageContainer_MyAccountContainer_symPound"]/following-sibling::text()')->item(0)->nodeValue;
echo $value;

Related

How to echo json data in WordPress plugin using php [duplicate]

This question already has answers here:
Get JSON object from URL
(11 answers)
Closed 2 years ago.
I am creating a Courier tracking plugin and get data using their API. The output they are returned is in JSON format. Here is the courier API
$curl_handle = curl_init();
// For Direct Link Access use below commented link
//curl_setopt($curl_handle, CURLOPT_URL, 'http://new.leopardscod.com/webservice/trackBookedPacket/?api_key=XXXX&api_password=XXXX&track_numbers=XXXXXXXX'); // For Get Mother/Direct Link
curl_setopt($curl_handle, CURLOPT_URL, 'http://new.leopardscod.com/webservice/trackBookedPacket/format/json/'); // Write here Test or Production Link
curl_setopt($curl_handle, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl_handle, CURLOPT_POST, 1);
curl_setopt($curl_handle, CURLOPT_POSTFIELDS, array(
'api_key' => 'your_api_key'
'api_password' => 'your_api_password'
'track_numbers' => 'string' // E.g. 'XXYYYYYYYY' OR 'XXYYYYYYYY,XXYYYYYYYY,XXYYYYYY' 10 Digits each number
));
$buffer = curl_exec($curl_handle);
curl_close($curl_handle);
How to echo the values get from this curl command?
Regards

Please use json_decode function after $buffer = curl_exec($curl_handle);
$buffer = curl_exec($curl_handle);
$json = json_decode($buffer, true);
print_r($json);
You can get PHP object array

Font or Unicode issue on Scraping [duplicate]

This question already has answers here:
PHP DOMDocument failing to handle utf-8 characters (☆)
(3 answers)
Closed 7 years ago.
Am trying to scrape info from a site.
The site have like this
127 East Zhongshan No 2 Rd; 中山东二路127号
But when i try to scrap it & echo it then it will show
127 East Zhongshan No 2 Rd; ä¸å±±ä¸äºè·¯127å·
I also try UTF-8
There is my php code
now please help me for solve this problem.
function GrabPage($site){
$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
curl_setopt($ch, CURLOPT_TIMEOUT, 40);
curl_setopt($ch, CURLOPT_COOKIEFILE, "cookie.txt");
curl_setopt($ch, CURLOPT_URL, $site);
ob_start();
return curl_exec ($ch);
ob_end_clean();
curl_close ($ch);
}
$GrabData = GrabPage($site);
$dom = new DOMDocument();
#$dom->loadHTML($GrabData);
$xpath = new DOMXpath($dom);
$mainElements = array();
$mainElements = $xpath->query("//div[#class='col--one-whole mv--col--one-half wv--col--one-whole'][1]/dl/dt");
foreach ($mainElements as $Names2) {
$Name2 = $Names2->nodeValue;
echo "$Name2";
}

First off, you need to set the charset before anything else on top of PHP file:
header('Content-Type: text/html; charset=utf-8');
You need to convert the html markup you got with mb_convert_encoding:
#$dom->loadHTML(mb_convert_encoding($GrabData, 'HTML-ENTITIES', 'UTF-8'));
Sample Output

First thing is to see if the captured HTML source is properly encoded. If yes try
utf8_decode($Name2)
This should get your string ready for saving as well as printing

How to extract innerHTML using the PHP Dom [duplicate]

This question already has answers here:
How to get innerHTML of DOMNode?
(9 answers)
Closed 2 years ago.
I'm currently using nodeValue to give me HTML output, however it is stripping the HTML code and just giving me plain text. Does anyone know how I can modify my code to give me the inner HTML of an element by using it's ID?
function getContent($url, $id){
// This first section gets the HTML stuff using a URL
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, true);
$html = curl_exec($ch);
curl_close($ch);
// This second section analyses the HTML and outputs it
$newDom = new domDocument;
$newDom->loadHTML($html);
$newDom->preserveWhiteSpace = false;
$newDom->validateOnParse = true;
$sections = $newDom->getElementById($id)->nodeValue;
echo $sections;
}

This works for me:
$sections = $newDom->saveXML($newDom->getElementById($id));
http://www.php.net/manual/en/domdocument.savexml.php
If you have PHP 5.3.6, this might also be an option:
$sections = $newDom->saveHTML($newDom->getElementById($id));
http://www.php.net/manual/en/domdocument.savehtml.php

I have modify the code, and it's working fine for me. Please find below the code
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, true);
$html = curl_exec($ch);
curl_close($ch);
$newDom = new domDocument;
libxml_use_internal_errors(true);
$newDom->loadHTML($html);
libxml_use_internal_errors(false);
$newDom->preserveWhiteSpace = false;
$newDom->validateOnParse = true;
$sections = $newDom->saveHTML($newDom->getElementById('colophon'));
echo $sections;

how to get a list of links in a webpage in PHP? [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Parse Website for URLs
How do I get all the links in a webpage using PHP?
I need to get a list of the links :-
Google
I want to fetch the href (http://www.google.com) and the text (Google)
-------------------situation is:-
I'm building a crawler and i want it to get all the links that exist in a database table.

There are a couple of ways to do this, but the way I would approach this is something like the following,
Use cURL to fetch the page, ie:
// $target_url has the url to be fetched, ie: "http://www.website.com"
// $userAgent should be set to a friendly agent, sneaky but hey...
$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html = curl_exec($ch);
if (!$html) {
echo "<br />cURL error number:" .curl_errno($ch);
echo "<br />cURL error:" . curl_error($ch);
exit;
}
If all goes well, page content is now all in $html.
Let's move on and load the page in a DOM Object:
$dom = new DOMDocument();
#$dom->loadHTML($html);
So far so good, XPath to the rescue to scrape the links out of the DOM object:
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
Loop through the result and get the links:
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$link = $href->getAttribute('href');
$text = $href->nodeValue
// Do what you want with the link, print it out:
echo $text , ' -> ' , $link;
// Or save this in an array for later processing..
$links[$i]['href'] = $link;
$links[$i]['text'] = $text;
}
$hrefs is an object of type DOMNodeList and item() returns a DOMNode object for the specified index. So basically we’ve got a loop that retrieves each link as a DOMNode object.
This should pretty much do it for you.
The only part I am not 100% sure of is if the link is an image or an anchor, what would happen in those conditions, I have no idea so you would need to test and filter those out.
Hope this gives you an idea of how to scrape links, happy coding.

Downloading multiple images using PHP cURL [duplicate]

This question already has answers here:
Saving image from PHP URL
(11 answers)
Closed 8 years ago.
I want to download images from a web page, for example, www.yahoo.com, and store it in a folder using PHP.
I am getting the page source using file_get_contents() and extracting the img src tag. I am passing this src to cURL code. The code does not give any error, but the images are not getting downloaded. Please check out the code. I am not getting where I am going wrong.
<?php
$html = file_get_contents('www.yahoo.com');
$ptn = '/< *img[^>]*src *= *["\']?([^"\']*)/i';
preg_match_all($ptn, $html, $matches, PREG_PATTERN_ORDER);
$seq = 1;
foreach($matches as $img)
{
$fp = fopen("root/Images/image_$seq.jpg", 'wb');
$ch = curl_init ($img);
curl_setopt($ch,CURLOPT_FILE, $fp);
curl_setopt($ch,CURLOPT_URL, $img);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
$image = curl_exec($ch);
curl_close($ch);
fwrite($fp, $image);
fclose($fp);
$seq++;
}
echo "IMAGES DOWNLOADED";
?>

foreach($matches as $img)
should be changed to
foreach($matches[1] as $img)
BTW: you should replace the file_get_contents by cURL, it's about 3x as fast;)

Is $img the full URL of the image?
Is the image protected (use referer)?
$image = false;
$ch = curl_init();
curl_setopt($ch, CURLOPT_REFERER,$url);
curl_setopt($ch, CURLOPT_URL, $url );
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 7);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch,CURLOPT_ENCODING,gzip);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
$image = curl_exec ($ch);
Try debugging first.
First try it with a single image from Yahoo, http://www.depers.nl/beeld/w100/2011/201105/20110510/anp/sport/img-100511-349.onlinebild.jpg.
Also, why use file_get_contents and curl? Use curl instead.
Make a function for cURL: function simple_curl ( $url,$binary=false){ set your cURL vars, return curl_exec).
Get yahoo.com: $result = simple_curl($url);
Get links with the pattern (check if the matches contains the full URL ( domain + directory + file ).
Loop each pattern match (don't forget: multi array!! So loop on $matches[1]).
curl binary file and save it: $image = simple_curl($match,true);

www.yahoo.com is not a URL, http://www.yahoo.com/ is.
$img is an array you need to iterate $matches[1]
You both tell cURL to write to a file and retrieve the result. Use one.
I don't know how you don't see errors. I would look into that. Copying and pasting and then running it gave me plenty of errors.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How do I extract text data from a web page? [duplicate] - php

Related

How to echo json data in WordPress plugin using php [duplicate]

Font or Unicode issue on Scraping [duplicate]

How to extract innerHTML using the PHP Dom [duplicate]

how to get a list of links in a webpage in PHP? [duplicate]

Downloading multiple images using PHP cURL [duplicate]

Categories

Resources