Using Goutte to extract a namespaced attribute value - php

I'm trying to check if I can read <html> properties of a webpage to get the owner-declared language.
99% of the sites I checked, I found that info written as <html lang="XX"> or <html lang="XX-YY"> but in 1 particular site I found it written as <html xml:lang="XX">, and this last case is giving me headache.
I tried
$scraper_client = new \Goutte\Client();
$scraper_crawler = $scraper_client->request('GET', $link);
$response = $scraper_client->getResponse();
var_dump( $scraper_crawler->filter('html')->extract('xml:lang')) );
var_dump( $scraper_crawler->filter('html')->extract('xml|lang')) );
var_dump( $scraper_crawler->filter('html')->extract('xml::lang')) );
var_dump( $scraper_crawler->filter('html')->extract('#[xml:lang]')) );
But none of them seems working. Did anyone already do something similar?
Thank you in advance.
S.
EDIT
Just to complete the question, here is a link that contains the xml:lang attribute that is causing me problems:
http://www.ilgiornale.it/news/politica/silvio-berlusconi-centrodestra-oggi-pi-forte-passato-1482545.html

I don't know why but it's like Goutte cuts off this attributes.
I've only able to get the value with a regular expression:
$scraper_client = new \Goutte\Client();
$scraper_crawler = $scraper_client->request('GET', $link);
$response = $scraper_client->getResponse();
if (preg_match('/xml:lang=["\']{1}(.*?)["\']{1}/', $response, $matches)) {
var_dump($matches[1]);
} else {
echo 'not found';
}

Related

PHP - preg_match() result in 0 matching values

I was trying to do web scraping for my personal webpage, using the bio and pics from a website profile (http://about.me/fernandocaldas) so whenever I change that profile the content in my web bio will also do.
The desired values are between
<script type="text/json" class="json user" data-scope="view_profile" data-lowercase_user_name="fernandocaldas">
and
</script>
Here is my code:
$thtml = file_get_contents('http://about.me/fernandocaldas');
$matchval = '/\<script type=\"text\/json\" class=\"json.*?>(.*?)\<\/script\>/i';
preg_match($matchval, $thtml, $match);
var_dump($match);
if($match){
echo "match!\n";
foreach($match[1] as $val)
{
echo $val."<br>";
}
}
But the result is always array(0) {} for the var_dump.
Regular expressions are never a good idea for HTML: today regex seems to work, but tomorrow they will fail!1
Frequently programmers think: “why I have to init a parser, load the HTML, performs a lot of queries if I can do it with only one line of regex code?”. The answer is “why choose the road that leads you in the wrong direction, although shorter?”.
In your case by using a Parser you can also shorten your code.
First, load your HTML page, init a new DOMDocument object, load HTML string into it and init a DOMXPath object (DOMXPath permits to perform complex HTML queries):
$dom = new DOMDocument();
libxml_use_internal_errors(1);
$dom->loadHTML( $html );
$xpath = new DOMXPath( $dom );
Search for the element(s) with tag <script> and class “json user”:
$found = $xpath->query( '//script[#class="json user"]' );
if( !$found->length ) die( 'Error retrieving JSON' );
Put the node value of first (and unique, in your page) node in a variable (I also trim it, but it is unnecessary) and decode it with json_decode():
$json = trim( $found->item(0)->nodeValue );
$user = json_decode( $json );
Now, in $user object, you have all the data you need. In $user->first_name you have your first name, in $user->bio you have your biography. By a print_r( $user ) you can display the complete $user structure to see how to access to each element.
Read more about DOMDocument
Read more about DOMXPath
Read why you can't parse [X]HTML with regular expressions
1 If the HTML structure change, also a parser will fail.

Php title regular expression

I'm trying to replace a title tag from |title|Page title| to <title>Page Title</title>, using this regular expression. But being a complete amateur, it's not gone to well..
'^|title|^[a-zA-Z0-9_]{1,}|$' => '<title>$1</title>'
I would love to know how to fix it, and more importantly, what I did wrong and why it was wrong.
You almost got it:
You should escape the | characters as they have special meaning in a
regex and you are using it as a plain character.
You should add the space character to your search group
$string = '|title|Page title|';
$pattern = '/\|title\|([a-zA-Z0-9_ ]{1,})\|/';
$replacement = '<title>$1</title>';
echo preg_replace($pattern, $replacement, $string); //echoes <title>Page title</title>
See working demo
OP posted some code in comments which is wrong, try this version:
$regular_expressions = array( array( '/\|title\|([a-zA-Z0-9_ ]{1,})\|/' , '<title>$1</title>' ));
foreach($regular_expressions as $regexp){
$data = preg_replace($regexp[0], $regexp[1], $data);
}
Heres a little function I came up with a while back to essentially scrape the titles of a page when users submitted links through my service. What this function does is will get the contents of a provided URL. Seek a title tag, if found, get whats between the title tag and dump it's result. With a little tweaking I am sure you can use a replace method for whatever your doing, and make it work for your needs. So this is more of a starting point rather than an answer but overall I hope it helps to some extent.
$url = 'http://www.chrishacia.com';
function get_page_title($url){
if( !($data = file_get_contents($url)) ) return false;
if( preg_match("#<title>(.+)<\/title>#iU", $data, $t)) {
return trim($t[1]);
} else {
return false;
}
}
var_dump(get_page_title($url));
<?php
$s = "|title|Page title|";
$s = preg_replace('/^\|title\|([^\|]+)\|/', "<title>$1</title>", $s);
echo $s;
?>

Simple PHP Screen Scraping Function

I'm experimenting with autoblogging (i.e., RSS-driven blog posting) using WordPress, and all that's missing is a component to automattically fill in the content of the post with the content that the RSS's URL links to (RSS is irrelevant to the solution).
Using standard PHP 5, how could I create a function called fetchHTML([URL]) that returns the HTML content of a webpage that's found between the <body>...</body> tags?
Please let me know if there are any prerequisite "includes".
Thanks.
Okay, here's a DOM parser code example as requested.
<?php
function fetchHTML( $url )
{
$content = file_get_contents($url);
$html=new DomDocument();
$body=$html->getelementsbytagname('body');
foreach($body as $b){ $content=$b->textContent; break; }//hmm, is there a better way to do that?
return $content;
}
Assuming that it will always be <body> and not <BODY> or <body style="width:100%"> or anything except <body> and </body>, and with the caveat that you shouldn't use regex to parse HTML, even though I'm about to, here ya go:
<?php
function fetchHTML( $url )
{
$feed = '<body>Lots of stuff in here</body>';
$content = file_get_contents( $url );
preg_match( '/<body>([\s\S]{1,})<\/body>/m', $content, $match );
$content = $match[1];
return $content;
} // fetchHTML
?>
If you echo fetchHTML([some url]);, you'll get the html between the body tags.
Please note original caveats.
I think you're better of using a class like SimpleDom -> http://sourceforge.net/projects/simplehtmldom/ to extract the data as you don't need to write such complicated regular expressions

Detect remote charset in php

I would like to determine a remote page's encoding through detection of the Content-Type header tag
<meta http-equiv="Content-Type" content="text/html; charset=XXXXX" />
if present.
I retrieve the remote page and try to do a regex to find the required setting if present.
I am still learning hence the problem below...
Here is what I have:
$EncStart = 'charset=';
$EncEnd = '" \/\>';
preg_match( "/$EncStart(.*)$EncEnd/s", $RemoteContent, $RemoteEncoding );
echo = $RemoteEncoding[ 1 ];
The above does indeed echo the name of the encoding but it does not know where to stop so it prints out the rest of the line then most of the rest of the remote page in my test.
Example: When testing a remote russian page it printed:
windows-1251" />
rest of page ....
Which means that $EncStart was okay, but the $EncEnd part of the regex failed to stop the matching. This meta header usually ends in 3 different possibility after the name of the encoding.
"> | "/> | " />
I do not know weather this is usable to satisfy the end of the maching and if yes how to escape it. I played with different ways of doing it but none worked.
Thank you in advance for lending a hand.
add a question mark to your pattern to make it non-greedy (and there's also no need of 's')
preg_match( "/charset=\"(.+?)\"/", $RemoteContent, $RemoteEncoding );
echo $RemoteEncoding[ 1 ];
note that this won't handle charset = "..." or charset='...' and many other combinations.
Take a look at Simple HTML Dom Parser. With it, you can easily find the charset from the head without resorting to cumbersome regexes. But as David already commented, you should also examine the headers for the same information and prioritize it if found.
Tested example:
require_once 'simple_html_dom.php';
$source = file_get_contents('http://www.google.com');
$dom = str_get_html($source);
$meta = $dom->find('meta[http-equiv=content-type]', 0);
$src_charset = substr($meta ->content, stripos($meta ->content, 'charset=') + 8);
foreach ($http_response_header as $header) {
#list($name, $value) = explode(':', $header, 2);
if (strtolower($name) == 'content-type') {
$hdr_charset = substr($value, stripos($value, 'charset=') + 8);
break;
}
}
var_dump(
$hdr_charset,
$src_charset
);

How to write a PHP script to find the number of indexed pages in Google?

I need to find the number of indexed pages in google for a specific domain name, how do we do that through a PHP script?
So,
foreach ($allresponseresults as $responseresult)
{
$result[] = array(
'url' => $responseresult['url'],
'title' => $responseresult['title'],
'abstract' => $responseresult['content'],
);
}
what do i add for the estimated number of results and how do i do that?
i know it is (estimatedResultCount) but how do i add that? and i call the title for example this way: $result['title'] so how to get the number and how to print the number?
Thank you :)
I think it would be nicer to Google to use their RESTful Search API. See this URL for an example call:
http://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=site:stackoverflow.com&filter=0
(You're interested in the estimatedResultCount value)
In PHP you can use file_get_contents to get the data and json_decode to parse it.
You can find documentation here:
http://code.google.com/apis/ajaxsearch/documentation/#fonje
Example
Warning: The following code does not have any kind of error checking on the response!
function getGoogleCount($domain) {
$content = file_get_contents('http://ajax.googleapis.com/ajax/services/' .
'search/web?v=1.0&filter=0&q=site:' . urlencode($domain));
$data = json_decode($content);
return intval($data->responseData->cursor->estimatedResultCount);
}
echo getGoogleCount('stackoverflow.com');
You'd load http://www.google.com/search?q=domaingoeshere.com with cURL and then parse the file looking for the results <p id="resultStats" bit.
You'd have the resulting html stored in a variable $html and then say something like
$arr = explode('<p id="resultStats"'>, $html);
$bottom = $arr[1];
$middle = explode('</p>', $bottom);
Please note that this is untested and a very rough example. You'd be better off parsing the html with a dedicated parser or matching the line with regular expressions.
google ajax api estimatedResultCount values doesn't give the right value.
And trying to parse html result is not a good way because google blocks after several search.
Count the number of results for site:yourdomainhere.com - stackoverflow.com has about 830k
// This will give you the count what you see on search result on web page,
//this code will give you the HTML content from file_get_contents
header('Content-Type: text/plain');
$url = "https://www.google.com/search?q=your url";
$html = file_get_contents($url);
if (FALSE === $html) {
throw new Exception(sprintf('Failed to open HTTP URL "%s".', $url));
}
$arr = explode('<div class="sd" id="resultStats">', $html);
$bottom = $arr[1];
$middle = explode('</div>', $bottom);
echo $middle[0];
Output:
About 8,130 results
//vKj
Case 2: you can also use google api, but its count is different:
https://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=ursitename&callback=processResults
https://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=site:google.com
cursor":{"resultCount":"111,000,000","
"estimatedResultCount":"111000000",

Categories