I am trying to fetch the content inside a <div> via file_get_contents. What I want to do is to fetch the content from the div resultStats on google.com. My problem is (afaik) printing it.
A bit of code:
$data = file_get_contents("https://www.google.com/?gws_rd=cr&#q=" . $_GET['keyword'] . "&gws_rd=ssl");
preg_match("#<div id='resultStats'>(.*?)<\/div>#i", $data, $matches);
Simply using
print_r($matches);
only returns Array(), but I want to preg_match the number. Any help is appreciated!
Edit: thanks for showing me the right direction! I got rid of the preg_ call and went for DOM instead. Although I am pretty new to PHP and this is giving me an headache; I found this code here on Stack Overflow and I am trying to edit it to get it to work. This far I only receive a blank page, and don't know what I am doing wrong.
$str = file_get_contents("https://www.google.com/search?source=hp&q=" . $_GET['keyword'] . "&gws_rd=ssl");
$DOM = new DOMDocument;
#$dom->loadHTML($str);
//get
$items = $DOM->getElementsByTagName('resultStats');
//print
for ($i = 0; $i < $items->length; $i++)
echo $items->item($i)->nodeValue . "<br/>";
} else { exit("No keyword!") ;}
Posted on behalf of the OP.
I decided to use the PHP Simple HTML DOM Parser and ended up something like this:
include_once('/simple_html_dom.php');
$setDomain = "https://www.google.com/search?source=hp&q=" . $_GET['keyword'] . "&gws_rd=ssl";
$str = file_get_html($setDomain);
$html = str_get_html($str);
$html->find('div div[id=resultStats]', 0)->innertext . '<br>';
Problem solved!
Related
I have this function to get title of a website:
function getTitle($Url){
$str = file_get_contents($Url);
if(strlen($str)>0){
preg_match("/\<title\>(.*)\<\/title\>/",$str,$title);
return $title[1];
}
}
However, this function make my page took too much time to response. Someone tell me to get title by request header of the website only, which won't read the whole file, but I don't know how. Can anyone please tell me which code and function i should use to do this? Thank you very much.
Using regex is not a good idea for HTML, use the DOM Parser instead
$html = new simple_html_dom();
$html->load_file('****'); //put url or filename
$title = $html->find('title');
echo $title->plaintext;
or
// Create DOM from URL or file
$html = file_get_html('*****');
// Find all images
foreach($html->find('title') as $element)
echo $element->src . '<br>';
Good read
RegEx match open tags except XHTML self-contained tags
Use jQuery Instead to get Title of your page
$(document).ready(function() {
alert($("title").text());
});
Demo : http://jsfiddle.net/WQNT8/1/
try this will work surely
include_once 'simple_html_dom.php';
$oHtml = str_get_html($url);
$Title = array_shift($oHtml->find('title'))->innertext;
$Description = array_shift($oHtml->find("meta[name='description']"))->content;
$keywords = array_shift($oHtml->find("meta[name='keywords']"))->content;
echo $title;
echo $Description;
echo $keywords;
I have started building a single Curl session with - curl, dom, xpath, and it worked great.
I am now building a scraper based off curl for taking data off multiple sites in one flow, and the script is echo'ing the single phrase i put in.. but it does not pick up variables.
do{
$n=curl_multi_exec($mh, $active);
}while ($active);
foreach ($urls as $i => $url){
$res[$i]=curl_multi_getcontent($conn[$i]);
echo ('<br />success');
}
So this does echo the success-text as many times as there are urls.. but really this is not what i want.. I want to break up the html like i could with the single curl session..
What i did in the single curl session:
//parse the html into a DOMDocument
$dom = new DOMDocument();
#$dom->loadHTML($res);
// grab all the on the page
$xpath = new DOMXPath($dom);
$product_img = $xpath->query("//div[#id='MAIN']//a");
for ($i = 0; i < $product_img->length; $i++){
$href = $product_img->item($i);
$url = $href->getAttribute('href');
echo "<br />Link : $url";
}
This dom parsing / xpath is working for the single session curl, but not when i run the multicurl.
On Multicurl i can do curl_multi_getcontent for the URL from the session, but this is not want..
I would like to get the same content as i picked up with Dom / Xpath in the single session.
What can i do ?
EDIT
It seems i am having problems with the getAttribute. It is a link for an image i am having trouble grabbing. The link is showing when scraping, but then it throws an error :
Fatal error: Call to a member function getAttribute() on a non-object in
The query:
// grab all the on the page
$xpath = new DOMXPath($dom);
$product_img = $xpath->query("//img[#class='product']");
$product_name = $xpath->query("//img[#class='product']");
This is working:
for ($i = 0; i < $product_name->length; $i++) {
$prod_name = $product_name->item($i);
$name = $prod_name->getAttribute('alt');
echo "<br />Link stored: $name";
}
This is not working:
for ($i = 0; i < $product_img->length; $i++) {
$href = $product_img->item($i);
$pic_link = $href->getAttribute('src');
echo "<br />Link stored: $pic_link";
}
Any idea of what i am doing wrong ?
Thanks in advance.
For some odd reason, it is only that one src that won't work right.
This question can be considered "solved".
I am trying to get issue details from JIRA 3.13 using PHP SOAP. I was able to login and get the issues; however, on one of my field, I could not get the new line formatting. So, all I got is the text for that particular field without new line character (everything just append into a single line of text). As of now, I am guessing php also did some re-formatting of the string from SOAP. The reason I am saying this is because I did some testing with SOAP UI and was able to get the text out with the formatting. Can anyone help me out with a way to displaying the text with the formatting? Thanks in advance.
This is my php code:
try {
$soap = new SoapClient("<<JIRA URL>>");
$auth = $soap->login($formUsername, $formPassword);
if ($auth)
{
$result0 = $soap->getIssue($auth,'<<JIRA ISSUE ID>>');
$result = (array) $result0;
foreach ($result as $key => $a)
{
$z = $z . '<br/>' . $key . ' = ' . $a;
}
echo $z;
}
}
catch(Exception $e){
$string = urlencode($e->getMessage());
header("Location: login.php?message=".$string);
die();
}
I just realize that I do not need to convert it into array.
Simply do the following:
foreach ($result0 as $key => $a)
{
$z = $z . '<br/>' . $key . ' = ' . $a;
}
This, however, still does not solve my problem with the new line.
Isn't it just because you don't change the linefeeds into <br/> before outputting it?
Should be easy to find out if that's the case just by looking at the source in the browser.
You need nl2br() to convert newline characters (\n et. al) into HTML <br> tags:
foreach ($result0 as $key => $a)
{
$z = $z . '<br/>' . $key . ' = ' . nl2br($a);
}
Could be that the text is stored with Unix line endings and you're displaying on a Windows machine? Which field is having the problem?
I have a string (not xml )
<headername>X-Mailer-Recptid</headername>
<headervalue>15772348</headervalue>
</header>
from this, i need to get the value 15772348, that is the value of headervalue. How is possible?
Use PHP DOM and traverse the headervalue tag using getElementsByTagName():
<?php
$doc = new DOMDocument;
#$doc->loadHTML('<headername>X-Mailer-Recptid</headername><headervalue>15772348</headervalue></header>');
$items = $doc->getElementsByTagName('headervalue');
for ($i = 0; $i < $items->length; $i++) {
echo $items->item($i)->nodeValue . "\n";
}
?>
This gives the following output:
15772348
[EDIT]: Code updated to suppress non-HTML warning about invalid headername and headervalue tags as they are not really HTML tags. Also, if you try to load it as XML, it totally fails to load.
This looks XML-like to me. Anyway, if you don't want to parse the string as XML (which might be a good idea), you could try something like this:
<?
$str = "<headervalue>15772348</headervalue>";
preg_match("/<headervalue\>([0-9]+)<\/headervalue>/", $str, $matches);
print_r($matches);
?>
// find string short way
function my_url_search($se_action_data)
{
// $regex = '/https?\:\/\/[^\" ]+/i';
$regex="/<headervalue\>([0-9]+)<\/headervalue>/"
preg_match_all($regex, $se_action_data, $matches);
$get_url=array_reverse($matches[0]);
return array_unique($get_url);
}
echo my_url_search($se_action_data)
<?php
$html = new simple_html_dom();
$html = str_get_html("<headername>X-Mailer-Recptid</headername>headervalue>15772348</headervalue></header>"); // Use Html dom here
$get_value=$html->find("headervalue", 0)->plaintext;
echo $get_value;
?>
http://simplehtmldom.sourceforge.net/manual.htm#section_find
I am "attempting" to scrape a web page that has the following structures within the page:
<p class="row">
<span>stuff here</span>
Descriptive Link Text
<div>Link Description Here</div>
</p>
I am scraping the webpage using curl:
<?php
$handle = curl_init();
curl_setopt($handle, CURLOPT_URL, "http://www.host.tld/");
curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($handle);
curl_close($handle);
?>
I have done some research and found that I should not use a RegEx to parse the HTML that is returned from the curl, and that I should use PHP DOM. This is how I have done this:
$newDom = new domDocument;
$newDom->loadHTML($html);
$newDom->preserveWhiteSpace = false;
$sections = $newDom->getElementsByTagName('p');
$nodeNo = $sections->length;
for($i=0; $i<$nodeNo; $i++){
$printString = $sections->item($i)->nodeValue;
echo $printString . "<br>";
}
Now I am not pretending that I completely understand this but I get the gist, and I do get the sections I am wanting. The only issue is that what I get is only the text of the HTML page, as if I had copied it out of my browser window. What I want is the actual HTML because I want to extract the links and use them too, like so:
for($i=0; $i<$nodeNo; $i++){
$printString = $sections->item($i)->nodeValue;
echo "LINK " . $printString . "<br>";
}
As you can see, I cannot get the link because I am only getting the text of the webpage and not the source, like I want. I know the "curl_exec" is pulling the HTML because I have tried just that, so I believe that the DOM is somehow stripping the HTML that I want.
According to comments on the PHP manual on DOM, you should use the following inside your loop:
$tmp_dom = new DOMDocument();
$tmp_dom->appendChild($tmp_dom->importNode($sections->item($i), true));
$innerHTML = trim($tmp_dom->saveHTML());
This will set $innerHTML to be the HTML content of the node.
But I think what you really want is to get the 'a' nodes under the 'p' node, so do this:
$sections = $newDom->getElementsByTagName('p');
$nodeNo = $sections->length;
for($i=0; $i<$nodeNo; $i++) {
$sec = $sections->item($i);
$links = $sec->getElementsByTagName('a');
$linkNo = $links->length;
for ($j=0; $j<$linkNo; $j++) {
$printString = $links->item($j)->nodeValue;
echo $printString . "<br>";
}
}
This will just print the body of each link.
You can pass a node to DOMDocument::saveXML(). Try this:
$printString = $newDom->saveXML($sections->item($i));
you might want to take a look at phpQuery for doing server-side HTML parsing things. basic example