I appreciate the time you take to try and help me with my question.
So what i am doing is trying an html parser from a link. So I use curl first to link to the website then I convert it into htmlentities() so it doesn't load on the page so I get a string from that then i use the DOM object to extract the tag from. I checked different methods for a parser on google search so i learned a little bit about it then i execute my script but the problem is that the string is getting saved as textCont and not as a real html document so i would like to know how can convert htmlentities string into a real dom document and extract elements from it ?
the image of the var_dump is here
here is my script:
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, 'https://www.usatoday.com/story/news/world/2021/02/17/dubai-princess-sheikha-latifa-says-she-hostage-after-flee-attempt/6778014002/?utm_source=feedblitz&utm_medium=FeedBlitzRss&utm_campaign=usatodaycomworld-topstories');
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
$result = curl_exec($curl);
curl_close($curl);
$htmlentities = htmlentities($result);
// I added the code here
$htmlDom = new DOMDocument();
$htmlDom->loadHTML($htmlentities);
$htmlDom->preserveWhiteSpace = false;
$styles = $htmlDom->getElementsByTagName('style');
foreach ($styles as $style) {
$item = $style->getElementsByTagName('td');
//echo the values
echo '1: '.$item->item(0)->nodeValue.'<br />';
echo '2: '.$item->item(1)->nodeValue.'<br />';
echo '3: '.$item->item(2)->nodeValue;
}
EDIT:
what i added next to the code is this:
$htmlentities = htmlentities($result);
$htmlentities = str_replace(""",'"', $htmlentities);
$htmlentities = str_replace("'","'", $htmlentities);
$htmlentities = str_replace("<","<", $htmlentities);
$htmlentities = str_replace(">",">", $htmlentities);
libxml_use_internal_errors(true);
$htmlDom = new DOMDocument();
$htmlDom->loadHTML($htmlentities);
libxml_clear_errors();
var_dump($htmlDom);
Related
I tried to extract the download url from the webpage.
the code which tried is below
function getbinaryurl ($url)
{
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_FRESH_CONNECT, true);
$value1 = curl_exec($curl);
curl_close($curl);
$start = preg_quote('<script type="text/x-component">', '/');
$end = preg_quote('</script>', '/');
$rx = preg_match("/$start(.*?)$end/", $value1, $matches);
var_dump($matches);
}
$url = "https://www.sourcetreeapp.com/download-archives";
getbinaryurl($url);
this way i am getting the tags info not the content inside the script tag. how to get the info inside.
expected result is:
https://product-downloads.atlassian.com/software/sourcetree/ga/Sourcetree_4.0.1_234.zip,
https://product-downloads.atlassian.com/software/sourcetree/windows/ga/SourceTreeSetup-3.3.6.exe,
https://product-downloads.atlassian.com/software/sourcetree/windows/ga/SourcetreeEnterpriseSetup_3.3.6.msi
i am very much new in writing these regular expressions. can any help me pls.
Instead of using regex, using DOMDocument and XPath allows you to have more control of the elements you select.
Although XPath can be difficult (same as regex), this can look more intuitive to some. The code uses //script[#type="text/x-component"][contains(text(), "macURL")] which broken down is
//script = any script node
[#type="text/x-component"] = which has an attribute called type with
the specific value
[contains(text(), "macURL")] = who's text contains the string macURL
The query() method returns a list of matches, so loop over them. The content is JSON, so decode it and output the values...
function getbinaryurl ($url)
{
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_FRESH_CONNECT, true);
$value1 = curl_exec($curl);
curl_close($curl);
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($value1);
libxml_use_internal_errors(false);
$xp = new DOMXPath($doc);
$srcs = $xp->query('//script[#type="text/x-component"][contains(text(), "macURL")]');
foreach ( $srcs as $src ) {
$content = json_decode( $src->textContent, true);
echo $content['params']['macURL'] . PHP_EOL;
echo $content['params']['windowsURL'] . PHP_EOL;
echo $content['params']['enterpriseURL'] . PHP_EOL;
}
}
$url = "https://www.sourcetreeapp.com/download-archives";
getbinaryurl($url);
which outputs
https://product-downloads.atlassian.com/software/sourcetree/ga/Sourcetree_4.0.1_234.zip
https://product-downloads.atlassian.com/software/sourcetree/windows/ga/SourceTreeSetup-3.3.8.exe
https://product-downloads.atlassian.com/software/sourcetree/windows/ga/SourcetreeEnterpriseSetup_3.3.8.msi
I'm creating a little web app to help me manage and analyze the content of my websites, and cURL is my favorite new toy. I've figured out how to extract info about all sorts of elements, how to find all elements with a certain class, etc., but I am stuck on two problems (see below). I hope there is some nifty xpath answer, but if I have to resort to regular expressions I guess that's ok. Although I'm not so great with regex so if you think that's the way to go, I'd appreciate examples...
Pretty standard starting point:
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html = curl_exec($ch);
if (!$html) {
$info .= "<br />cURL error number:" .curl_errno($ch);
$info .= "<br />cURL error:" . curl_error($ch);
return $info;
}
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
and extraction of info, for example:
// iframes
$iframes = $xpath->evaluate("/html/body//iframe");
$info .= '<h3>iframes ('.$iframes->length.'):</h3>';
for ($i = 0; $i < $iframes->length; $i++) {
// get iframe attributes
$iframe = $iframes->item($i);
$framesrc = $iframe->getAttribute("src");
$framewidth = $iframe->getAttribute("width");
$frameheight = $iframe->getAttribute("height");
$framealt = $iframe->getAttribute("alt");
$frameclass = $iframe->getAttribute("class");
$info .= $framesrc.' ('.$framewidth.'x'.$frameheight.'; class="'.$frameclass.'")'.'<br />';
}
Questions/Problems:
How to extract HTML comments?
I can't figure out how to identify the comments – are they considered nodes, or something else entirely?
How to get the entire content of a div, including child nodes? So if the div contains an image and a couple of hrefs, it would find those and hand it all back to me as a block of HTML.
Comment nodes should be easy to find in XPath with the comment() test, analogous to the text() test:
$comments = $xpath->query('//comment()'); // or another path, as you prefer
They are standard nodes: here is the manual entry for the DOMComment class.
To your other question, it's a bit trickier. The simplest way is to use saveXML() with its optional $node argument:
$html = $dom->saveXML($el); // $el should be the element you want to get
// the HTML for
For the HTML comments a fast method is:
function getComments ($html) {
$rcomments = array();
$comments = array();
if (preg_match_all('#<\!--(.*?)-->#is', $html, $rcomments)) {
foreach ($rcomments as $c) {
$comments[] = $c[1];
}
return $comments;
} else {
// No comments matchs
return null;
}
}
That Regex
\s*<!--[\s\S]+?-->
Helps to you.
In regex Test
for comments your looking for recursive regex. For instance, to get rid of html comments:
preg_replace('/<!--(?(?=<!--)(?R)|.)*?-->/s',$yourHTML);
to find them:
preg_match_all('/(<!--(?(?=<!--)(?R)|.)*?-->)/s',$yourHTML,$comments);
I want the HTML code from the URL.
Actually I want following things from the data at one URL.
1. blog titile
2. blog image
3. blod posted date
4. blog description or actual blog text
I tried below code but no success.
<?php
$c = curl_init('http://54.174.50.242/blog/');
curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
//curl_setopt(... other options you want...)
$html = curl_exec($c);
if (curl_error($c))
die(curl_error($c));
// Get the status code
$status = curl_getinfo($c, CURLINFO_HTTP_CODE);
curl_close($c);
echo "Status :".$status; die;
?>
Please help me out to get the necessary data from the URL(http://54.174.50.242/blog/).
Thanks in advance.
You are halfway there. You curl request is working and $html variable is containing blog page source code. Now you need to extract data, that you need, from html string. One way to do it is by using DOMDocument class.
Here is something you could start with:
$c = curl_init('http://54.174.50.242/blog/');
curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($c);
$dom = new DOMDocument;
// disable errors on invalid html
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$list = $dom->getElementsByTagName('title');
$title = $list->length ? $list->item(0)->textContent : '';
// and so on ...
You can also simpllify that by using method loadHTMLFile on DOMDocument class, that way you don't have to worry about all curl code boilerplate:
$dom = new DOMDocument;
// disable errors on invalid html
libxml_use_internal_errors(true);
$dom->loadHTMLFile('http://54.174.50.242/blog/');
$list = $dom->getElementsByTagName('title');
$title = $list->length ? $list->item(0)->textContent : '';
echo $title;
// and so on ...
You Should use Simple HTML Parser . And extract html using
$html = #file_get_html($url);foreach($html->find('article') as element) { $title = $dom->find('h2',0)->plaintext; .... }
I am also using this, Hope it is working.
i am using below code to extract url from a webpage and its working just fine but i want to filter it. it will display all urls in that page but i want only those url which consists of the word "super"
$regex='|<a.*?href="(.*?)"|';
preg_match_all($regex,$result,$parts);
$links=$parts[1];
foreach($links as $link){
echo $link."<br>";
}
so it should echo only uls where the word super is present.
for example it should ignore url
http://xyz.com/abc.html
but it should echo
http://abc.superpower.com/hddll.html
as it consists of the required word super in url
Make your regex un-greedy and it should work:
$regex = '|<a.*?href="(.*?super[^"]*)"|is';
However to parse and scrap HTML it is better to use php's DOM parser.
Update: Here is code using DOM parser:
$request_url ='1900girls.blogspot.in/';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $request_url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$result = curl_exec($ch);
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($result); // loads your html
$xpath = new DOMXPath($doc);
$needle = 'blog';
$nodelist = $xpath->query("//a[contains(#href, '" . $needle . "')]");
for($i=0; $i < $nodelist->length; $i++) {
$node = $nodelist->item($i);
echo $node->getAttribute('href') . "\n";
}
Now preg has always been a tool to me that i like but i cant figure out for the life if me if what i want to do is possible let and how to do it is going over my head
What i want is preg_match to be able to return me a div's innerHTML the problem is the div im tring to read has more divs in it and my preg keeps closing on the first tag it find
Here is my Actual code
$scrape_address = "http://isohunt.com/torrent_details/133831593/98e034bd6382e0f4ecaa9fe2b5eac01614edc3c6?tab=summary";
$ch = curl_init($scrape_address);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, '1');
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_ENCODING, "");
$data = curl_exec($ch);
preg_match('% <div id="torrent_details">(.*)</div> %six', $data, $match);
print_r($match);
This has been updated for TomcatExodus's help
Live at :: http://megatorrentz.com/beta/details.php?hash=98e034bd6382e0f4ecaa9fe2b5eac01614edc3c6
<?php
$scrape_address = "http://isohunt.com/torrent_details/133831593/98e034bd6382e0f4ecaa9fe2b5eac01614edc3c6?tab=summary";
$ch = curl_init($scrape_address);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, '1');
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_ENCODING, "");
$data = curl_exec($ch);
$domd = new DOMDocument();
libxml_use_internal_errors(true);
$domd->loadHTML($data);
libxml_use_internal_errors(false);
$div = $domd->getElementById("torrent_details");
if ($div) {
$dom2 = new DOMDocument();
$dom2->appendChild($dom2->importNode($div, true));
echo $dom2->saveHTML();
} else {
echo "Has no element with the given ID\n";
}
Using regular expression leads often to problems when parsing markup documents.
XPath version - independent of the source layout. The only thing you need is a div with that id.
loadHTMLFile($url);
$xp = new domxpath($dom);
$result = $xp->query("//*[#id = 'torrent_details']");
$div=$result->item(0);
if($result->length){
$out =new DOMDocument();
$out->appendChild($out->importNode($div, true));
echo $out->saveHTML();
}else{
echo "No such id";
}
?>
And this is the fix for Maerlyn solution. It didn't work because getElementById() wants a DTD with the id attribute specified. I mean, you can always build a document with "apple" as the record id, so you need something that says "id" is really the id for this tag.
validateOnParse = true;
#$domd->loadHTML($data);
//this doesn't work as the DTD is not specified
//or the specified id attribute is not the attributed called "id"
//$div = $domd->getElementById("torrent_details");
/*
* workaround found here: https://fosswiki.liip.ch/display/BLOG/GetElementById+Pitfalls
* set the "id" attribute as the real id
*/
$elements = $domd->getElementsByTagName('div');
if (!is_null($elements)) {
foreach ($elements as $element) {
//try-catch needed because of elements with no id
try{
$element->setIdAttribute('id', true);
}catch(Exception $e){}
}
}
//now it works
$div = $domd->getElementById("torrent_details");
//Print its content or error
if ($div) {
$dom2 = new DOMDocument();
$dom2->appendChild($dom2->importNode($div, true));
echo $dom2->saveHTML();
} else {
echo "Has no element with the given ID\n";
}
?>
Both of the solutions work for me.
You can do this:
/]>(.)<\/div>/i
Which would give you the largest possible innerHTML.
You cannot. I will not link to the famous question, because I dislike the pointless drivel on top. But still regular expressions are unfit to match nested structures.
You can use some trickery, but this is neither reliable, nor necessarily fast:
preg_match_all('#<div id="1">((<div>.*?</div>|.)*?)</div>#ims'
Your regex had a problem due to the /x flag not matching the opening div. And you used a wrong assertion notation.
preg_match_all('% <div \s+ id="torrent_details">(?<innerHtml>.*)</div> %six', $html, $match);
echo $match['innerHtml'];
That one will work, but you should only need preg_match not preg_match_all if the pages are written well, there should only be one instance of id="torrent_details" on the given page.
I'm retracting my answer. This will not work properly. Use DOM for navigating the document.
haha did it with a bit of tampering thanks for the DOMDocument idea i just to use simple
$ch = curl_init($scrape_address);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, '1');
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_ENCODING, "");
$data = curl_exec($ch);
$doc = new DOMDocument();
libxml_use_internal_errors(false);
$doc->strictErrorChecking = FALSE;
libxml_use_internal_errors(true);
$doc->loadHTML($data);
$xml = simplexml_import_dom($doc);
print_r($xml->body->table->tr->td->table[2]->tr->td[0]->span[0]->div);