This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 9 years ago.
I need to read some content from a html page.
I've tested simple_html_dom, but it simply isn't usable for what I need it for.
I need something like this (pseaudo syntax based on simple_html_dom):
$html = file_get_contents($url);
$html_obj = parse_html($html);
$title = $html_obj->get('title');
$meta1 = $html_obj->get('meta[name=description]', 'innertext']; //text only
$meta2 = $html_obj->get('meta[name=keywords]', 'innertext']; // text only
$content = $html_obj->get('div[id=section_a]', outerText); //html code
I've tested simple_html_dom in so many ways, and only managed to get parts of what I need.
It simply isn't "simple".
I've also tested PHP DOMDocument::loadHTML, but it I run in to problems dealing with inline <script>.
Are there any php librarys that makes it as easy to get content as in jQuery?
Update
One of my problems is a a piece of 3rd party javascript from an add agency:
<script language="javascript" type="text/javascript">
<!--
if (window.adgroupid == undefined) {
window.adgroupid = Math.round(Math.random()*100000);
}
document.write('<scr'+'ipt language="javascript1.1" type="text/javascript" src="http://adserver.adtech.de/addyn|3.0|994|3159100|0|-1|size=980x150|ADTECH;loc=100;target=_blank;key=startside,kvinner, kvinnesak, bryllup, graviditet, mamma, kosmetikk, markedsplass, dagbok, feminisme;grp='+window.adgroupid+';misc='+new Date().getTime()+'"></scri'+'pt>');
//-->
</script>
Even if I change <scr'+'ipt to <script it gives me invalid javascript code.
You can use DOMDocument with DOMXPath ..
<?php
$DOMDocument = new DOMDocument();
//libxml_use_internal_errors ( true ) ;
$DOMDocument->loadHTMLFile ( 'http://www.iconfinder.com' ) ;
$XPath = new DOMXPath( $DOMDocument );
$title = $DOMDocument->getElementsByTagName('title')->item(0)->nodeValue;
echo $title ;
#$desc = $XPath->query('//meta[#name=description]')->item(0)->getAttribute ( 'content' );
#$keywords = $XPath->query('//meta[#name=keywords]')->item(0)->getAttribute( 'content' );
#$content = $XPath->query('//div[#id=section_a]')->item(0)->nodeValue;
PHPQuery (http://code.google.com/p/phpquery/) allows you to manipulate HTML through a jquery like syntax
Related
This question already has answers here:
Reference - What does this error mean in PHP?
(38 answers)
Closed 7 years ago.
I am trying to extract the title text from an html page and insert it into an object. I am using symphony and php. The result from filterXPATH does not seem to be plain text and instead it is the entire html page and throwing error. I don't know why.
My code is:
$html = $this->file_get_contents_curl("http://www.google.com/");
$urlData = [];
$crawler = new Crawler($html);
$urlData->title = $crawler->filterXPath('//title')->extract('_text');
I see the title text if I do:
return $crawler->filterXPath('//title')->extract('_text');
Try this,
libxml_use_internal_errors(true);
$html = file_get_contents("http://www.google.com/");
$dom1 = new DOMDocument;
$dom1->preserveWhiteSpace = false;
$dom1->loadHTML($html);
$xp = new DOMXPath($dom1);
$xp->registerNamespace("php", "http://php.net/xpath");
$urlData= $xp->query('//title');
foreach($urlData as $title) {
echo $title->textContent;
}
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I am totally new to PHP development and I would like to extract the contents of a meta tag.
I have this code that allows me to extract the contents of the element # squad.
// Pull in PHP Simple HTML DOM Parser
include("simplehtmldom/simple_html_dom.php");
// Settings on top
$sitesToCheck = array(
// id is the page ID for selector
array("url" => "http://www.arsenal.com/first-team/players", "selector" => "#squad"),
array("url" => "http://www.liverpoolfc.tv/news", "selector" => "ul[style='height:400px;']")
);
$savePath = "cachedPages/";
$emailContent = "";
// For every page to check...
foreach($sitesToCheck as $site) {
$url = $site["url"];
// Calculate the cachedPage name, set oldContent = "";
$fileName = md5($url);
$oldContent = "";
// Get the URL's current page content
$html = file_get_html($url);
// Find content by querying with a selector, just like a selector engine!
foreach($html->find($site["selector"]) as $element) {
$currentContent = $element->plaintext;;
}
// If a cached file exists
if(file_exists($savePath.$fileName)) {
// Retrieve the old content
$oldContent = file_get_contents($savePath.$fileName);
}
// If different, notify!
if($oldContent && $currentContent != $oldContent) {
// Build simple email content
$emailContent = "Hey, the following page has changed!\n\n".$url."\n\n";
}
// Save new content
file_put_contents($savePath.$fileName,$currentContent);
}
// Send the email if there's content!
if($emailContent) {
// Sendmail!
mail("me#myself.name","Sites Have Changed!",$emailContent,"From: alerts#myself.name","\r\n");
// Debug
echo $emailContent;
}
But I want to change this code to get the number of comments in income.
Here is the meta tag where i would just extract the number of comments :
<meta item="desc" content="Comments:645">
Am I clear enough, do you understand me?
If I am not explicit enough, ask me?
Thanks for help
There's two ways to do this. You could either use the native PHP function: get_meta_tags() like so:
$tags = get_meta_tags('http://yoursite.com');
$comments = $tags['desc'];
Or you could use RegEx, but the above would be much more practical.
What you are looking for might be screen scraping.
This is the process where a programming-language like php, python or ruby loads a website in memory and uses various selectors to grab content from it.
Screen scraping is mostly used on websites that feature a lot of interesting data but have no json or xml API's
having googled around for it I stumbled on this post:
PHP equivalent of PyQuery or Nokogiri?
This article explains more about screen-scraping for web:
http://en.wikipedia.org/wiki/Web_scraping
Look for use domDocument
$dom = new domDocument;
$dom->loadHTML($htmlPage);
$metas = $dom->documentElement->getElementsByTagName('meta');
$ar = array();
foreach ($metas as $meta) {
$name = $meta->getAttribute('name');
$value = $meta->getAttribute('content');
$ar[$name] = $value;
}
print_r($ar); // print array meta-values
This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 9 years ago.
I've just started PHP and I want to scrape a little page which I can't, I tried doing 'PREG_MATCH_ALL' but it just doesn't get the result I want.. Basically I want to scrape the youtube video links from here only: https://gdata.youtube.com/feeds/api/standardfeeds/most_shared - Scrape all of them and then use them later.
I tried using the following code which failed;
<?php
$data = file_get_contents('https://gdata.youtube.com/feeds/api/standardfeeds/most_shared');
preg_match_all("/src='(.+?)'>/", $data, $links);
$link_out = $links[0][0];
echo $link_out;
?>
I'm new to PHP, so little help please.
Thanks
As the feed is XML, you can use PHP's SimpleXMLElement to obtain the data.
<?php
$xml = new SimpleXMLElement(
'https://gdata.youtube.com/feeds/api/standardfeeds/most_shared',
null,
true
);
foreach($xml->entry as $entry) {
echo $entry->content['src'], PHP_EOL;
}
/*
https://www.youtube.com/v/IjWc43FCYlg?version=3&f=standard&app=youtube_gdata
https://www.youtube.com/v/Xw1C5T-fH2Y?version=3&f=standard&app=youtube_gdata
https://www.youtube.com/v/Kq0_dGKx4Os?version=3&f=standard&app=youtube_gdata
https://www.youtube.com/v/gbcBYs0ljI0?version=3&f=standard&app=youtube_gdata
https://www.youtube.com/v/78juOpTM3tE?version=3&f=standard&app=youtube_gdata
https://www.youtube.com/v/OOiZ-5DqwYI?version=3&f=standard&app=youtube_gdata
https://www.youtube.com/v/zjz614QVyfQ?version=3&f=standard&app=youtube_gdata
https://www.youtube.com/v/h15m87WsCHQ?version=3&f=standard&app=youtube_gdata
https://www.youtube.com/v/SXKOTdyOUBg?version=3&f=standard&app=youtube_gdata
https://www.youtube.com/v/BRAM8MpqIeA?version=3&f=standard&app=youtube_gdata
https://www.youtube.com/v/5yB3n9fu-rM?version=3&f=standard&app=youtube_gdata
https://www.youtube.com/v/NAOo9SnzRH8?version=3&f=standard&app=youtube_gdata
https://www.youtube.com/v/0KtILkzC-1g?version=3&f=standard&app=youtube_gdata
https://www.youtube.com/v/kWSIFh8ICaA?version=3&f=standard&app=youtube_gdata
https://www.youtube.com/v/Mi6AhogZCeg?version=3&f=standard&app=youtube_gdata
https://www.youtube.com/v/kWuIGAZ1x2I?version=3&f=standard&app=youtube_gdata
https://www.youtube.com/v/lKY5fmDGVLs?version=3&f=standard&app=youtube_gdata
https://www.youtube.com/v/C94PaCtqOk4?version=3&f=standard&app=youtube_gdata
https://www.youtube.com/v/V-fL8zopddI?version=3&f=standard&app=youtube_gdata
https://www.youtube.com/v/UWlzMIl7E48?version=3&f=standard&app=youtube_gdata
https://www.youtube.com/v/mcw6j-QWGMo?version=3&f=standard&app=youtube_gdata
https://www.youtube.com/v/-RSDaRttpzk?version=3&f=standard&app=youtube_gdata
https://www.youtube.com/v/8_RDx4skTp4?version=3&f=standard&app=youtube_gdata
https://www.youtube.com/v/7YDWdv9kR0M?version=3&f=standard&app=youtube_gdata
https://www.youtube.com/v/m96tYpEk1Ao?version=3&f=standard&app=youtube_gdata
*/
Anthony.
Try with this pregmatch:
preg_match_all("/src='([^']+)'/si", $data, $links);
and show results:
echo "<pre>";
print_r($links);
<?php
$data = file_get_contents('https://gdata.youtube.com/feeds/api/standardfeeds/most_shared');
preg_match_all("/src='(.+?)'\/>/", $data, $links);
print_r($links[1]);
You forgot to match the closing / of the anchor tags.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to parse and process HTML with PHP?
Let's say I want to extract a certain number/text from a table from here: http://www.fifa.com/associations/association=chn/ranking/gender=m/index.html
I want to get the first number on the right table td under FIFA Ranking position. That would be 88 right now. Upon inspection, it is <td class="c">88</td>.
How would I use PHP to extract the info from said webpage?
edit: I am told JQuery/JavaScript it is for this... better suited
This could probably be prettier, but it'd go something like:
<?php
$page = file_get_contents("http://www.fifa.com/associations/association=chn/ranking/gender=m/index.html");
preg_match('/<td class="c">[0-9]*</td>/',$page,$matches);
foreach($matches as $match){
echo str_replace(array( "/<td class=\"c\">", "</td>"), "", $match);
}
?>
I've never done anything like this before with PHP, so it may not work.
If you can work your magic after page load, you can use JavaScript/JQuery
<script type='text/javascript'>
var arr = [];
jQuery('table td.c').each(
arr[] = jQuery(this).html();
);
return arr;
</script>
Also, sorry for deleting my comment. You weren't specific as to what needed to be done, so I initially though jQuery would better fit your needs, but then I thought "Maybe you want to get the page content before an HTML page is loaded".
Try http://simplehtmldom.sourceforge.net/,
$html = file_get_html('http://www.google.com/');
echo $html->find('div.rankings', 0)->find('table', 0)->find('tr',0)->find('td.c',0)->plaintext;
This is untested, just looking at the source. I'm sure you could target it faster.
In fact,
echo $html->find('div.rankings', 0)->find('td.c',0)->plaintext;
should work.
Using DOMDocument, which should be pre-loaded with your PHP installation:
$dom = new DOMDocument();
$dom->loadHTML(file_get_contents("http://www.example.com/file.html"));
$xpath = new DOMXPath($dom);
$cell = $xpath->query("//td[#class='c']")->item(0);
if( $cell) {
$number = intval(trim($cell->textContent));
// do stuff
}
I'm experimenting with autoblogging (i.e., RSS-driven blog posting) using WordPress, and all that's missing is a component to automattically fill in the content of the post with the content that the RSS's URL links to (RSS is irrelevant to the solution).
Using standard PHP 5, how could I create a function called fetchHTML([URL]) that returns the HTML content of a webpage that's found between the <body>...</body> tags?
Please let me know if there are any prerequisite "includes".
Thanks.
Okay, here's a DOM parser code example as requested.
<?php
function fetchHTML( $url )
{
$content = file_get_contents($url);
$html=new DomDocument();
$body=$html->getelementsbytagname('body');
foreach($body as $b){ $content=$b->textContent; break; }//hmm, is there a better way to do that?
return $content;
}
Assuming that it will always be <body> and not <BODY> or <body style="width:100%"> or anything except <body> and </body>, and with the caveat that you shouldn't use regex to parse HTML, even though I'm about to, here ya go:
<?php
function fetchHTML( $url )
{
$feed = '<body>Lots of stuff in here</body>';
$content = file_get_contents( $url );
preg_match( '/<body>([\s\S]{1,})<\/body>/m', $content, $match );
$content = $match[1];
return $content;
} // fetchHTML
?>
If you echo fetchHTML([some url]);, you'll get the html between the body tags.
Please note original caveats.
I think you're better of using a class like SimpleDom -> http://sourceforge.net/projects/simplehtmldom/ to extract the data as you don't need to write such complicated regular expressions