Simple PHP Screen Scraping Function - php

I'm experimenting with autoblogging (i.e., RSS-driven blog posting) using WordPress, and all that's missing is a component to automattically fill in the content of the post with the content that the RSS's URL links to (RSS is irrelevant to the solution).
Using standard PHP 5, how could I create a function called fetchHTML([URL]) that returns the HTML content of a webpage that's found between the <body>...</body> tags?
Please let me know if there are any prerequisite "includes".
Thanks.

Okay, here's a DOM parser code example as requested.
<?php
function fetchHTML( $url )
{
$content = file_get_contents($url);
$html=new DomDocument();
$body=$html->getelementsbytagname('body');
foreach($body as $b){ $content=$b->textContent; break; }//hmm, is there a better way to do that?
return $content;
}

Assuming that it will always be <body> and not <BODY> or <body style="width:100%"> or anything except <body> and </body>, and with the caveat that you shouldn't use regex to parse HTML, even though I'm about to, here ya go:
<?php
function fetchHTML( $url )
{
$feed = '<body>Lots of stuff in here</body>';
$content = file_get_contents( $url );
preg_match( '/<body>([\s\S]{1,})<\/body>/m', $content, $match );
$content = $match[1];
return $content;
} // fetchHTML
?>
If you echo fetchHTML([some url]);, you'll get the html between the body tags.
Please note original caveats.

I think you're better of using a class like SimpleDom -> http://sourceforge.net/projects/simplehtmldom/ to extract the data as you don't need to write such complicated regular expressions

Related

How to use htmlspecialchars only on <code></code> tags.

I'm using WordPress and would like to create a function that applies the PHP function htmlspecialchars only to code contained between <code></code> tags. I appreciate this may be fairly simple but I'm new to PHP and can't find any references on how to do this.
So far I have the following:
function FilterCodeOnSave( $content, $post_id ) {
return htmlspecialchars($content, ENT_NOQUOTES);
}
Obviously the above is very simple and performs htmlspecialchars on the entire content of my page. I need to limit the function to only apply to the HTML between code tags (there may be multiple code tags on each page).
Any help would be appreciated.
Thanks,
James
EDIT: updated to avoid multiple CODE tags
Try this:
<?php
// test data
$textToScan = "Hi <code>test12</code><br>
Line 2 <code><br>
Test <b>Bold</b><br></code><br>
Test
";
// debug:
echo $textToScan . "<hr>";
// the regex pattern (case insensitive & multiline
$search = "~<code>(.*?)</code>~is";
// first look for all CODE tags and their content
preg_match_all($search, $textToScan, $matches);
//print_r($matches);
// now replace all the CODE tags and their content with a htmlspecialchars() content
foreach($matches[1] as $match){
$replace = htmlspecialchars($match);
// now replace the previously found CODE block
$textToScan = str_replace($match, $replace, $textToScan);
}
// output result
echo $textToScan;
?>
Use DOMDocument to get all <code> tags;
// in this example $dom is an instance of DOMDocument
$code_tags = $dom->getElementsByTagName('code');
if ($code_tags) {
foreach ($code_tags as $node) {
// [...]
}
// [...]
}
i know this is a little bit late, but you can call the htmlspecialchars function first and then when outputting call the htmlspecialchars_decode function

preg_replace with wildcards?

I have HTML markup bearing the form
<div id='abcd1234A'><p id='wxyz1234A'>Hello</p></div>
which I need to replace to bear the form
<div id='abcd1234AN'><p id='wxyz1234AN'>Hello</p></div>
where N may be 1,2.. .
The best I have been able to do is as follows
function cloneIt($a,$b)
{
return substr_replace($a,$b,-1);
}
$ndx = "1'";
$str = "<div id='abcd1234A'><p id='wxyz1234A'>Hello</p></div>";
preg_match_all("/id='[a-z]{4}[0-9]{4}A'/",$str,$matches);
$matches = $matches[0];
$reps = array_merge($matches);
$ndxs = array_fill(0,count($reps),$ndx);
$reps = array_map("cloneIt",$reps,$ndxs);
$str = str_replace($matches,$reps,$str);
echo htmlspecialchars($str);
which works just fine. However, my REGEX skills are not much to write home about so I suspect that there is probably a better way to do this. I'd be most obliged to anyone who might be able to suggest a neater/quicker way of accomplishing the same result.
You can optimize your regex like this:
/id='[a-z]{4}\d{4}A'/
Sample code
preg_match_all("/id='[a-z]{4}\\d{4}A'/",$str,$matches);
However an alternative would consist in using en HTML parser. Here I'll use simple html dom:
// Load the HTML from URL or file
$html = file_get_html('http://www.mysite.com/');
// You can also load $html from string: $html = str_get_html($my_string);
// Find div with id attribute
foreach($html->find('div[id]') as $div) {
if (preg_match("/id='([a-z]{4}\\d{4})A'/" , $div->id, $matches)) {
$div->id = $matches[1] + $ndx;
}
}
echo $html->save();
Did you notice how elegant, concise and clear the code becomes with an html parser ?
References
Simple Html Dom Documentation

How to get data from url link to our website

I have a problem with my website. I want to get all the schedule flight data from another website. I see its source code and get the url link for processing data. Can somebody tell me how to get the data from the current url link, then display it to our website with PHP?
You can do it using file_get_contents() function. this function return html of provided url. then use HTML Parser to get required data.
$html = file_get_contents("http://website.com");
$dom = new DOMDocument();
$dom->loadHTML($html);
$nodes = $dom->getElementsByTagName('h3');
foreach ($nodes as $node) {
echo $node->nodeValue."<br>"; // return <h3> tag data
}
Another way to extract data using preg_match_all()
$html = file_get_contents($_REQUEST['url']);
preg_match_all('/<div class="swrapper">(.*?)<\/div>/s', $html, $matches);
// specify the class to get the data of that class
foreach ($matches[1] as $node) {
echo $node."<br><br><br>";
}
Use file_get_contents
Sample code
<?php
$homepage = file_get_contents('http://www.google.com/');
echo $homepage;
?>
Yes Sure ... Use file_get_contents('$URl') function to get the source code of the target page or use curl if you prefer using curl .. and scrap all data you need with preg_match_all() function
Note : If the target url has https:// then you should use curl to get the source code
Example
http://stackoverflow.com/questions/2838253/php-curl-preg-match-extract-text-from-xhtml

Replace an element with Dom Document PHP

I load a html page with PHP Dom Document :
$doc = new DOMDocument();
#$doc->loadHTMLFile($url);
I search in my page all "a" elements, and if they realize my condition i need to replace for example My link is beautiful by just My link is beautiful
Here my loop :
$liens = $div->getElementsByTagName('a');
foreach($liens as $lien){
if($lien->hasAttribute('href')){
if (preg_match("/metz2/i", $lien->getAttribute('href'))) {
//HERE I NEED TO REPLACE </a>
}
$cpt++;
}
}
Do you have any ideas ? Suggestions ? Thanks :)
Every time i need to manage DOM with PHP, i use a framework called PHP Simple HTLM DOM parser. (Link here)
It's very easy to use, something like this might work for you:
// Create DOM from URL or file
$html = file_get_html('http://www.page.com/');
// Find all links
foreach($html->find('a') as $element) {
//Do your custom logic here if you need it, for example this extracts the inner contents of the a-tag, and puts it freely.
$inner = $element->innertext;
$element->outertext($inner);
}
//To echo modified html again:
echo $html;
Could be done with preg_replace as well:
$sText = 'Stackoverflow';
$sText = preg_replace( '/<a.*>(.*)<\/a>/', '$1', $sText );
echo $sText;

Parsing data with REGEX (PHP)

I want to parse data in between brackets. Below is the code where you can see what I'm doing. I would also like to avoid using XML.
$query = '
[page:1]
<html>
all the html
</html>
[/page:1]
[page:2]
<html>
all the html
</html>
[/page:2]
';
I want to create a loop script that will use regex to find all instances of [page:x]; which in the example above is 2. And then with a get function we can specify the page we want.
if(isset($_GET['page'])) {
$page = $_GET['page'];
$regex = '\\['page':(.*?)\\';
echo preg_match($regex, $query);
}
Any thoughts?
This should find all the matching blocks at once:
preg_match_all('/\[page:([0-9]+)\](.+?)\[\/page:$1\]/', $page, $matches)
I strongly doubt regex is the most suitable solution for what you're trying to accomplish though.

Categories