Extract text with DOM Parser - php

I am just beginning to learn DOM Parser.
Let's assume that in http://test.com I have 4 lines like the one below and I am trying to extract the context as text.
All I need is LPPR 051600Z 35010KT CAVOK 27/14 Q1020 to send as a JSON payload to an incoming webhook.
<FONT FACE="Monospace,Courier">LPPR 051600Z 35010KT CAVOK 27/14 Q1020</FONT><BR>
From this example, how can I do it using $html = str_get_html and $html->find ???
I managed to send the complete HTML content, but that's not what I want.
<?php
include_once('simple_html_dom.php');
$html = file_get_html('http://test.com')->plaintext;
// The data to send to the API
$postData = array('text' => $html);
// Setup cURL
$ch = curl_init('https://uri.com/test');
curl_setopt_array($ch, array(
CURLOPT_POST => TRUE,
CURLOPT_RETURNTRANSFER => TRUE,
CURLOPT_HTTPHEADER => array(
'Authorization: '.$authToken,
'Content-Type: application/json'
),
CURLOPT_POSTFIELDS => json_encode($postData)
));
// Send the request
$response = curl_exec($ch);
// Check for errors
if($response === FALSE){
die(curl_error($ch));
}
// Decode the response
$responseData = json_decode($response, TRUE);
// Print the date from the response
echo $responseData['published'];
?>
Many Thanks

If you are certain that the line is exactly like this one, you can
$line = explode('<br>', $response);
This will create an array with the <FONT>xxxxx</FONT> of each line in each position.
To get only the text from the 2nd line
$filteredResponse = strip_tags($line[1]);

you can use PHP:DOM is an alternative for simple_html_dom
below example gets links from google search.
<?php
# Use the Curl extension to query Google and get back a page of results
$url = "http://www.google.com";
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$html = curl_exec($ch);
curl_close($ch);
# Create a DOM parser object
$dom = new DOMDocument();
# Parse the HTML from Google.
# The # before the method call suppresses any warnings that
# loadHTML might throw because of invalid HTML in the page.
#$dom->loadHTML($html);
# Iterate over all the <a> tags
foreach($dom->getElementsByTagName('font') as $link) {
# Show the <font>
echo $link->textContent;
echo "<br />";
}
?>
$dom->getElementsByTagName('font') replace tag that you want.
Happy scraping
reference :
http://htmlparsing.com/php.html
http://php.net/manual/en/book.dom.php

Related

php curl to get img src value on a page

I want to get img src value on a page if it is
https://www.google.com
then result will be like
https://www.google.com/images/branding/googlelogo/1x/googlelogo_color_272x92dp.png
https://www.google.com/ff.png
https://www.google.com/idk.jpg
i want something like this!
Thanks
<?php
# Use the Curl extension to query Google and get back a page of results
$url = "https://www.google.com";
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$html = curl_exec($ch);
curl_close($ch);
# Create a DOM parser object
$dom = new DOMDocument();
# Parse the HTML from Google.
# The # before the method call suppresses any warnings that
# loadHTML might throw because of invalid HTML in the page.
#$dom->loadHTML($html);
# Iterate over all the <a> tags
foreach($dom->getElementsByTagName('img') as $link) {
# Show the <a href>
echo $link->getAttribute('src');
echo "<br />";
}
?>
Here it is

Retrieving XML from URL with authentication

I have been all over the internet trying to find a solution to my specific problem but no luck.
Basically I have a URL that I log into that looks similar to this:
https://some-website.university.edu.au:8781/elements/v4.9/users/
Which will return to the browsers an XML block of text with all of the users.
I am looking to use curl or SimpleXMLElement() or whatever it takes to bring that XML into my php variable and output it.
The closest I feel I have got is:
$username = 'usernameX';
$password = 'passwordX';
$URL = 'https://some-website.university.edu.au:8781/elements/v4.9/users/';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$URL);
curl_setopt($ch, CURLOPT_TIMEOUT, 30); //timeout after 30 seconds
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_HTTPAUTH, CURLAUTH_ANY);
curl_setopt($ch, CURLOPT_USERPWD, $username.":".$password);
$result=curl_exec ($ch);
$status_code = curl_getinfo($ch, CURLINFO_HTTP_CODE); //get status code
echo $result;
or
$url = 'https://usernameX:passwordX#some-website.university.edu.au:8781/elements/v4.9/users/';
echo $url."<br />";
$xml = new SimpleXMLElement($url);
print_r($xml);
I'm not sure if either is close or whether curl is better than SimpleXMLElement() or if one or both just will never work.
I have added a screenshots to show the what is returned on the website. The login screen is just the browser default one. Any help would be amazing. Thanks!
XML Returned on web page
You might be better off using curl imo:
You can try something like this for authentication with curl:
$username = 'admin';
$password = 'mypass';
$server = 'myserver.com';
$context = stream_context_create(array(
'http' => array(
'header' => "Authorization: Basic " . base64_encode("$username:$password")
)
)
);
$data = file_get_contents("http://$server/", false, $context);
$xml=simplexml_load_string($data);
if ($xml === false) {
echo "Failed loading XML: ";
foreach(libxml_get_errors() as $error) {
echo "<br>", $error->message;
}
} else {
print_r($xml);
}
another way to parse xml from remote url. (not tested)
<?php
$username = 'usernameX';
$password = 'passwordX';
$url = 'https://some-website.university.edu.au:8781/elements/v4.9/users/';
$context = stream_context_create(array(
'http' => array(
'header' => "Authorization: Basic " . base64_encode("$username:$password"),
'header' => 'Accept: application/xml'
)
));
$data = file_get_contents($url, false, $context);
$xml = simplexml_load_string($data);
print_r($xml);
PHP
stream_get_contents — Reads remainder of a stream into a string
file_get_contents — Reads entire file into a string
simplexml_load_string — Interprets a string of XML into an object

Unable to scrape content using html dom parser from a particular website

I have been trying to scrape contents from websites and have been successful with some sites. But my code fails to scrape content from flipkart.com. I use HTML DOM PARSER and this is my code..
<?php
include ('simple_html_dom.php');
$scrape_url = 'https://www.flipkart.com/lenovo-f309-2-tb-external-hard-disk-drive/p/itmehwha6zkhkgfw';
$html = file_get_html($scrape_url);
foreach($html->find('h1._3eAQiD') as $title_s)
echo $title_s->plaintext;
foreach($html->find('div.hGSR34') as $ratings_s)
echo $ratings_s->plaintext;
?>
This code is giving empty result. Can someone let me know what the problem is? Is there any other way to scrape contents from this site?
This code worked for me.
get_content_by_class(curl('https://www.flipkart.com/lenovo-f309-2-tb-external-hard-disk-drive/p/itmehwha6zkhkgfw'), "hGSR34");
function curl($url) {
$ch = curl_init(); // Initialising cURL
//curl_setopt($ch, CURLOPT_IPRESOLVE, CURL_IPRESOLVE_V4);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT , 0);
curl_setopt($ch, CURLOPT_URL, $url); // Setting cURL's URL option with the $url variable passed into the function
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); // Setting cURL's option to return the webpage data
$data = curl_exec($ch); // Executing the cURL request and assigning the returned data to the $data variable
curl_close($ch); // Closing cURL
return $data; // Returning the data from the function
}
function get_content_by_class($html, $container_class_name) {
//preg_match_all('/<div class="' . $container_class_name .'">(.*?)<\/div>/s', $html, $matches);
preg_match_all('#<\s*?div class="'. $container_class_name . '\b[^>]*>(.*?)</div\b[^>]*>#s', $html, $matches);
//
foreach($matches as $match){
$match1 = str_replace('<','&lt',$match);
$match2 = str_replace('>','&gt',$match1);
print_r($match2);
}
if (empty($matches)){
echo 'no matches found';
echo '</br>';
}
//return $matches;
}

simple_http_dom - how to send request to each url

I have got 16 links in my get-listing.php output and I need to send the request to each URL to get the responses, for which I need to receive the list of elements when send the request to each URL.
$base1 = "http://testbox.elementfx.com/get-listing.php";
$html = file_get_html($base1);
$links = $html->find('p[id=links] a');
foreach ($links as $element)
{
//open each url in each array
$urls[] = $url = $element->href;
$data = file_get_html($url);
}
When I use the code as above, it will only send the request to each url to get the response which I have got 9 responses. I should have more than 9 responses.
Can you please tell me how I can send request to every url to get the responses using with simple_http_dom?
If your question is to send a simple request to each of the urls you've already parsed and get a response back, try file_get_contents:
foreach ($links as $element)
{
// This array stack is only necessary if you plan on using it later
$urls[] = $url = $element->href;
// $opts and $context are optional for specifying options like method
$opts = array(
'http'=>array(
'method'=>"GET", // "GET" or "POST"
)
);
$context = stream_context_create($opts);
// Remove context argument if not using options array
$data = file_get_contents($url, false, $context);
// ... Do something with $data
}
Your other option is more complex but has more flexibility (depending on the application) is Curl:
foreach ($links as $element)
{
$urls[] = $url = $element->href;
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
// Used for GET request
curl_setopt($ch, CURLOPT_POSTFIELDS, null);
curl_setopt($ch, CURLOPT_POST, FALSE);
curl_setopt($ch, CURLOPT_HTTPGET, TRUE);
// Necessary to return data
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$data = curl_exec($ch);
curl_close($ch);
// ... Do something with $data
}
This is scratching the surface of Curl, the documentation on PHP's website (linked above) has more information.
If the data being returned from the urls is HTML, you can pass it through PHP's DomDocument to parse after it has been pulled. Documentation and examples are readily available on PHP's site (I can't post more links right now, sorry).

PHP cURL and SimpleHTMLDom - Can't get ASP Viewstate Value

As the title says, how can I get the ASP ViewState Value? I'm using the code below that I thought would work. Thanks!
// create curl resource
$ch = curl_init();
// set url
curl_setopt($ch, CURLOPT_URL, "http://butlercountyclerk.org/bcc-11112005/ForeclosureSearch.aspx");
$data = array(
'Search:btnSearch' => 'Search',
'Search:ddlMonth' => '1',
'Search:ddlYear' => '2011',
'Search:txtCaseNumber' => '',
'Search:txtCompanyName' => '',
'Search:txtLastName' => '',
'__EVENTTARGET' => urlencode('Search:dgSearch:_ctl14:_ctl2'),
'__VIEWSTATE' => 'dDwtMjk2Mjk5NzczO3Q8O2w8aTwxPjs+O2w8dDw7bDxpPDE+Oz47bDx0PDtsPGk8Mz47aTwxOT47PjtsPHQ8dDw7cDxsPGk8MD47aTwxPjtpPDI+O2k8Mz47aTw0PjtpPDU+Oz47bDxwPDIwMDY7MjAwNj47cDwyMDA3OzIwMDc+O3A8MjAwODsyMDA4PjtwPDIwMDk7MjAwOT47cDwyMDEwOzIwMTA+O3A8MjAxMTsyMDExPjs+Pjs+Ozs+O3Q8QDA8Ozs7Ozs7Ozs7Oz47Oz47Pj47Pj47Pj47PmVlaXw5JK161vti9TC+QMdeTNQI'
);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $data);
//return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
// $output contains the output string
$output = curl_exec($ch);
// Create a DOM object
$html = new simple_html_dom();
// Load HTML from a string
$html->load($output);
// Find all images
$value = $html->find('input[name="__VIEWSTATE"]');
foreach ($value as $v) {
echo $v->value;
}
//echo $value->value;
// close curl resource to free up system resources
curl_close($ch);
//$output;
It would appear as though simple_html_dom is unable to parse some of the HTML. You can "clean" the HTML up through HTML Tidy. Add this before you load $output into simple_html_dom
$tidy = new tidy();
$output = $tidy->repairString($output);
Also you do not need "" around the attribute name.
HTML Tidy is an extension and will need to be loaded by PHP to work.

Categories