PHP regex in simple_html_dom library - php

I was trying to scrape imdb by following code.
$url = "http://www.imdb.com/search/title?languages=en|1&explore=year";
$html = new simple_html_dom();
$html->load(str_replace(' ','',$data = get_data($url)));
foreach($html->find('#left') as $total_movies)
{
$content = $total_movies->plaintext;
if(preg_match("/(?<total>[0-9,]+) titles/",$content,$matches))
{
print_r($matches);
}
echo $content."<br>";
}
get_data() is just a curl function i created.
The problem is that preg_match is not working. i don't know why but the same thing when used work here. $content contains the text what i scrape in above code.
$content = "1-50 of 101 titles.";
if(preg_match("/(?<total>[0-9,]+) titles/",$content,$matches))
print_r($matches);

The source on the site is actually:
<div id="left">
1-50 of 564,592
titles.
</div>
notice the \n this would need stripping out or added to your condition.
Heres a method to reach your goal without using any added extra library.
<?php
$url = "http://www.imdb.com/search/title?languages=en|1&explore=year";
$temp=file_get_contents($url);
$xml = new DOMDocument();
#$xml->loadHTML($temp);
foreach($xml->getElementsByTagName('div') as $div) {
if($div->getAttribute('id')=='left'){
preg_match("#of ([0-9,]+)#",$div->nodeValue,$match);
$matchs[]=preg_replace('/[^0-9]/', '', $match[0]);
}
}
echo number_format($matchs[0]); //564,592
?>

Related

php : parse html : extract script tags from body and inject before </body>?

I don't care what the library is, but I need a way to extract <.script.> elements from the <.body.> of a page (as string). I then want to insert the extracted <.script.>s just before <./body.>.
Ideally, I'd like to extract the <.script.>s into 2 types;
1) External (those that have the src attribute)
2) Embedded (those with code between <.script.><./script.>)
So far I've tried with phpDOM, Simple HTML DOM and Ganon.
I've had no luck with any of them (I can find links and remove/print them - but fail with scripts every time!).
Alternative to
https://stackoverflow.com/questions/23414887/php-simple-html-dom-strip-scripts-and-append-to-bottom-of-body
(Sorry to repost, but it's been 24 Hours of trying and failing, using alternative libs, failing more etc.).
Based on the lovely RegEx answer from #alreadycoded.com, I managed to botch together the following;
$output = "<html><head></head><body><!-- Your stuff --></body></html>"
$content = '';
$js = '';
// 1) Grab <body>
preg_match_all('#(<body[^>]*>.*?<\/body>)#ims', $output, $body);
$content = implode('',$body[0]);
// 2) Find <script>s in <body>
preg_match_all('#<script(.*?)<\/script>#is', $content, $matches);
foreach ($matches[0] as $value) {
$js .= '<!-- Moved from [body] --> '.$value;
}
// 3) Remove <script>s from <body>
$content2 = preg_replace('#<script(.*?)<\/script>#is', '<!-- Moved to [/body] -->', $content);
// 4) Add <script>s to bottom of <body>
$content2 = preg_replace('#<body(.*?)</body>#is', '<body$1'.$js.'</body>', $content2);
// 5) Replace <body> with new <body>
$output = str_replace($content, $content2, $output);
Which does the job, and isn't that slow (fraction of a second)
Shame none of the DOM stuff was working (or I wasn't up to wading through naffed objects and manipulating).
To select all script nodes with a src-attribute
$xpathWithSrc = '//script[#src]';
To select all script nodes with content:
$xpathWithBody = '//script[string-length(text()) > 1]';
Basic usage(Replace the query with your actual xpath-query):
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXpath($doc);
foreach($xpath->query('//body//script[string-length(text()) > 1]') as $queryResult) {
// access the element here. Documentation:
// http://www.php.net/manual/de/class.domelement.php
}
$js = "";
$content = file_get_contents("http://website.com");
preg_match_all('#<script(.*?)</script>#is', $content, $matches);
foreach ($matches[0] as $value) {
$js .= $value;
}
$content = preg_replace('#<script(.*?)</script>#is', '', $content);
echo $content = preg_replace('#<body(.*?)</body>#is', '<body$1'.$js.'</body>', $content);
If you're really looking for an easy lib for this, I can recommend this one:
$dom = str_get_html($html);
$scripts = $dom->find('script')->remove;
$dom->find('body', 0)->after($scripts);
echo $dom;
There's really no easier way to do things like this in PHP.

How to get page title in php?

I have this function to get title of a website:
function getTitle($Url){
$str = file_get_contents($Url);
if(strlen($str)>0){
preg_match("/\<title\>(.*)\<\/title\>/",$str,$title);
return $title[1];
}
}
However, this function make my page took too much time to response. Someone tell me to get title by request header of the website only, which won't read the whole file, but I don't know how. Can anyone please tell me which code and function i should use to do this? Thank you very much.
Using regex is not a good idea for HTML, use the DOM Parser instead
$html = new simple_html_dom();
$html->load_file('****'); //put url or filename
$title = $html->find('title');
echo $title->plaintext;
or
// Create DOM from URL or file
$html = file_get_html('*****');
// Find all images
foreach($html->find('title') as $element)
echo $element->src . '<br>';
Good read
RegEx match open tags except XHTML self-contained tags
Use jQuery Instead to get Title of your page
$(document).ready(function() {
alert($("title").text());
});​
Demo : http://jsfiddle.net/WQNT8/1/
try this will work surely
include_once 'simple_html_dom.php';
$oHtml = str_get_html($url);
$Title = array_shift($oHtml->find('title'))->innertext;
$Description = array_shift($oHtml->find("meta[name='description']"))->content;
$keywords = array_shift($oHtml->find("meta[name='keywords']"))->content;
echo $title;
echo $Description;
echo $keywords;

php simple html dom parser how to get the content of html tag

I am trying to get the specific tag content, but seems I am not able to do so using following function
<?PHP
include_once('simple_html_dom.php');
function read_page($url = 'http://google.com')
{
$doc = new DOMDocument();
$data = file_get_html($url);
$content = $data->find('div#footer');
print_r( $content);
}
read_page();
?>
Try $data->find('div[id="footer"]')

Echoing only a div with php

I'm attempting to make a script that only echos the div that encolose the image on google.
$url = "http://www.google.com/";
$page = file($url);
foreach($page as $theArray) {
echo $theArray;
}
The problem is this echos the whole page.
I want to echo only the part between the <div id="lga"> and the next closest </div>
Note: I have tried using if's but it wasn't working so I deleted them
Thanks
Use the built-in DOM methods:
<?php
$page = file_get_contents("http://www.google.com");
$domd = new DOMDocument();
libxml_use_internal_errors(true);
$domd->loadHTML($page);
libxml_use_internal_errors(false);
$domx = new DOMXPath($domd);
$lga = $domx->query("//*[#id='lga']")->item(0);
$domd2 = new DOMDocument();
$domd2->appendChild($domd2->importNode($lga, true));
echo $domd2->saveHTML();
In order to do this you need to parse the DOM and then get the ID you are looking for. Check out a parsing library like this http://simplehtmldom.sourceforge.net/manual.htm
After feeding your html document into the parser you could call something like:
$html = str_get_html($page);
$element = $html->find('div[id=lga]');
echo $element->plaintext;
That, I think, would be your quickest and easiest solution.

Replacing words in php with preg_replace in a given div area

Want to replace some words on the fly on my website.
$content = preg_replace('/\bWord\b/i', 'Replacement', $content);
That works so far. But now i want only change the the words which are inside
div id="content"
How do i do that?
$dom = new DOMDocument();
$dom->loadHTML($html);
$x = new DOMXPath($dom);
$pattern = '/foo/';
foreach($x->query("//div[#id='content']//text()") as $text){
preg_match_all($pattern,$text->wholeText,$occurances,PREG_OFFSET_CAPTURE);
$occurances = array_reverse($occurances[0]);
foreach($occurances as $occ){
$text->replaceData($occ[1],strlen($occ[0]),'oof');
}
//alternative if you want to do it in one go:
//$text->parentNode->replaceChild(new DOMText(preg_replace($pattern,'oof',$text->wholeText)),$text);
}
echo $dom->saveHTML();
//replaces all occurances of 'foo' with 'oof'
//if you don't really need a regex to match a word, you can limit the text-nodes
//searched by altering the xpath to "//div[#id='content']//text()[contains(.,'searchword')]"
use the_content filter, you can place it in your themes function.php file
add_filter('the_content', 'your_custom_filter');
function your_custom_filter($content) {
$pattern = '/\bWord\b/i'
$content = preg_replace($pattern,'Replacement', $content);
return $content;
}
UPDATE: This applies only if you are using WordPress of course.
If the content is dynamically driven then just echo the return value of $content into the div with id of content. If the content is static then you'll have to either use this PHP snippet on the text then echo out the return into the div, or use JavaScript (dirty method!).
$content = "Your string of text goes here";
$content = preg_replace('/\bWord\b/i', 'Replacement', $content);
<div id="content">
<?php echo $content; ?>
</div>

Categories