PHP Parse content from url - php

i need some help regarding this study script im building which im trying to fetch articles from a website.
Currently im able to get the article from 1 element but failing to get all elements, this is an example of the url im trying to fetch
<div class="entry-content">
</div>
<div class="entry-content">
</div>
<div class="entry-content">
</div>
This is my PHP code to get the content of the first div :
function getArticle($url){
$content = file_get_contents($url);
$first_step = explode( '<div class="entry-content">' , $content );
$separate_news = explode("</div>" , $first_step[1] );
$article = $separate_news[0];
echo $article;
}

You should really use PHPs DOMDocument class for parsing HTML. In terms of your example code, the problem is that you're not processing all the results from your $first_step array. You could try something like this:
$first_steps = explode( '<div class="entry-content">' , $content );
foreach ($first_steps as $first_step) {
if (strpos($first_step, '</div>') === false) continue;
$separate_news = explode("</div>" , $first_step );
$article = $separate_news[0];
echo $article;
}
Here's a small demo on 3v4l.org

I have used this library before http://simplehtmldom.sourceforge.net/ . Full documentation is found here http://simplehtmldom.sourceforge.net/manual.htm .
It's very easy to use and does a lot more.
You could select your articles like:
$html = file_get_html($url);
$articles = $html->find(".entry-content");
foreach($articles as $article) echo $article->plaintext;

You should use DOMDocument. Although it is a bit tricky to select nodes by CSS class, you can do it with DomXPath like this:
$dom = new DomDocument();
$dom->load($url);
$xpath = new DomXPath($dom);
$classname="entry-content";
$nodes = $xpath->query('//*[contains(concat(" ", normalize-space(#class), " "), " entry-content ")]');
foreach($nodes as $node) {
echo $node->textContent . "\n";
}
The advantage is now also that HTML entities and other HTML that might occur inside the article content is converted as expected. Like & becomes &, and <b>bold</b> just becomes bold.

Related

PHP DOMDocument node.Value Replacement

I have 3 p tags in email.php
$output='<p>Hey Jim</p>';
$output.='<p>We appreciate you are looking at using our services!</p>';
$output.='<p>Thanks Again</p>';
I want to be able to replace the text within those p tags on the fly from test.php with the text from newp1, newp2, and newp3.
$newp1 = "Hello Mark";
$newp2 = "We have scheduled your pick-up for tomorrow morning.";
$newp3 = "Any questions gives us a call.";
$url = 'email.php';
$html = file_get_contents($url);
$doc = new DOMDocument();
#$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('p');
foreach($nodes as $item ){
echo $item->nodeValue.'<br>';
}
I am currently echoing them to see them, but have no clue on how to actually replace them.
No DOMDocument required, in this example:
You can use in email.php something like that:
$output='<p>##msg1##</p>';
$output.='<p>##ms2##</p>';
$output.='<p>##msg3##</p>';
and in test.php:
$html = str_replace("##msg1##", $newp1, $html);
$html = str_replace("##msg2##", $newp2, $html);
$html = str_replace("##msg3##", $newp3, $html);

php : parse html : extract script tags from body and inject before </body>?

I don't care what the library is, but I need a way to extract <.script.> elements from the <.body.> of a page (as string). I then want to insert the extracted <.script.>s just before <./body.>.
Ideally, I'd like to extract the <.script.>s into 2 types;
1) External (those that have the src attribute)
2) Embedded (those with code between <.script.><./script.>)
So far I've tried with phpDOM, Simple HTML DOM and Ganon.
I've had no luck with any of them (I can find links and remove/print them - but fail with scripts every time!).
Alternative to
https://stackoverflow.com/questions/23414887/php-simple-html-dom-strip-scripts-and-append-to-bottom-of-body
(Sorry to repost, but it's been 24 Hours of trying and failing, using alternative libs, failing more etc.).
Based on the lovely RegEx answer from #alreadycoded.com, I managed to botch together the following;
$output = "<html><head></head><body><!-- Your stuff --></body></html>"
$content = '';
$js = '';
// 1) Grab <body>
preg_match_all('#(<body[^>]*>.*?<\/body>)#ims', $output, $body);
$content = implode('',$body[0]);
// 2) Find <script>s in <body>
preg_match_all('#<script(.*?)<\/script>#is', $content, $matches);
foreach ($matches[0] as $value) {
$js .= '<!-- Moved from [body] --> '.$value;
}
// 3) Remove <script>s from <body>
$content2 = preg_replace('#<script(.*?)<\/script>#is', '<!-- Moved to [/body] -->', $content);
// 4) Add <script>s to bottom of <body>
$content2 = preg_replace('#<body(.*?)</body>#is', '<body$1'.$js.'</body>', $content2);
// 5) Replace <body> with new <body>
$output = str_replace($content, $content2, $output);
Which does the job, and isn't that slow (fraction of a second)
Shame none of the DOM stuff was working (or I wasn't up to wading through naffed objects and manipulating).
To select all script nodes with a src-attribute
$xpathWithSrc = '//script[#src]';
To select all script nodes with content:
$xpathWithBody = '//script[string-length(text()) > 1]';
Basic usage(Replace the query with your actual xpath-query):
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXpath($doc);
foreach($xpath->query('//body//script[string-length(text()) > 1]') as $queryResult) {
// access the element here. Documentation:
// http://www.php.net/manual/de/class.domelement.php
}
$js = "";
$content = file_get_contents("http://website.com");
preg_match_all('#<script(.*?)</script>#is', $content, $matches);
foreach ($matches[0] as $value) {
$js .= $value;
}
$content = preg_replace('#<script(.*?)</script>#is', '', $content);
echo $content = preg_replace('#<body(.*?)</body>#is', '<body$1'.$js.'</body>', $content);
If you're really looking for an easy lib for this, I can recommend this one:
$dom = str_get_html($html);
$scripts = $dom->find('script')->remove;
$dom->find('body', 0)->after($scripts);
echo $dom;
There's really no easier way to do things like this in PHP.

PHP regex in simple_html_dom library

I was trying to scrape imdb by following code.
$url = "http://www.imdb.com/search/title?languages=en|1&explore=year";
$html = new simple_html_dom();
$html->load(str_replace(' ','',$data = get_data($url)));
foreach($html->find('#left') as $total_movies)
{
$content = $total_movies->plaintext;
if(preg_match("/(?<total>[0-9,]+) titles/",$content,$matches))
{
print_r($matches);
}
echo $content."<br>";
}
get_data() is just a curl function i created.
The problem is that preg_match is not working. i don't know why but the same thing when used work here. $content contains the text what i scrape in above code.
$content = "1-50 of 101 titles.";
if(preg_match("/(?<total>[0-9,]+) titles/",$content,$matches))
print_r($matches);
The source on the site is actually:
<div id="left">
1-50 of 564,592
titles.
</div>
notice the \n this would need stripping out or added to your condition.
Heres a method to reach your goal without using any added extra library.
<?php
$url = "http://www.imdb.com/search/title?languages=en|1&explore=year";
$temp=file_get_contents($url);
$xml = new DOMDocument();
#$xml->loadHTML($temp);
foreach($xml->getElementsByTagName('div') as $div) {
if($div->getAttribute('id')=='left'){
preg_match("#of ([0-9,]+)#",$div->nodeValue,$match);
$matchs[]=preg_replace('/[^0-9]/', '', $match[0]);
}
}
echo number_format($matchs[0]); //564,592
?>

Extract and dump a DOM node (and its children) in PHP

’I have the following scenario and I'm already spending hours trying to handle it: I'm developing a Wordpress theme (hence PHP) and I want to check whether the content of a post (which is HTML) contains a tag with a certain id/class. If so, I want to extract it from the content and place it somewhere else.
Example: Let's say the text content of the Wordpress post is
<?php
/* $content actually comes from WP function get_the_content() */
$content = '<p>some text and so forth that I don\'t care about...</p> <div class="the-wanted-element"><p>I WANT THIS DIV!!!</p></div>';
?>
So how can I extract that div with the class (could also live with giving it an ID), output it (with tags and all that) in one place of the template, and output the rest (without the extracted tag, of course) in another place of the template?
I've already tried with the DOMDocument class, p.i.t.a. to me, maybe I'm too stupid.
Try:
$content = '<p>some text and so forth that I don\'t care about...</p> <div class="the-wanted-element"><p>I WANT THIS DIV!!!</p></div>';
$dom = new DomDocument;
$dom->loadHtml($content);
$xpath = new DomXpath($dom);
$contents = '';
foreach ($xpath->query('//div[#class="the-wanted-element"]') as $node) {
$contents = $dom->saveXml($node);
break;
}
echo $contents;
How to get the remaining xml/html:
$content = '<p>some text and so forth that I don\'t care about...</p> <div class="the-wanted-element"><p>I WANT THIS DIV!!!</p></div>';
$dom = new DomDocument;
$dom->loadHtml($content);
$xpath = new DomXpath($dom);
foreach ($xpath->query('//div[#class="the-wanted-element"]') as $node) {
$node->parentNode->removeChild($node);
break;
}
$contents = '';
foreach ($xpath->query('//body/*') as $node) {
$contents .= $dom->saveXml($node);
}
echo $contents;

Replace the content of a tag with a certain class

I am looking for suitable replacement code that allows me replace the content inside of any HTML tag that has a certain class e.g.
$class = "blah";
$content = "new content";
$html = '<div class="blah">hello world</div>';
// code to replace, $html now looks like:
// <div class="blah">new content</div>
Bare in mind that:
It wont necessarily be a div, it could be <h2 class="blah">
The class can have more than one class and still needs to be replaced e.g. <div class="foo blah green">hello world</div>
I am thinking regular expressions should be able to do this, if not I am open to other suggestions such as using the DOM class (although I would rather avoid this if possible because it has to be PHP4 compatible).
Do not use regular expressions to parse HTML. You can use the built in DOMDocument, or something like simple_html_dom:
require_once("simple_html_dom.php");
$class = "blah";
$content = "new content";
$html = '<div class="blah">hello world</div>';
$doc = new simple_html_dom();
$doc->load($html);
foreach ( $doc->find("." . $class) as $node ) {
$node->innertext = $content;
}
Sorry, I didn't see the PHP4 requirement. Here's a solution using the standard DOMDocument as mentioned above.
function DOM_getElementByClassName($referenceNode, $className, $index=false) {
$className = strtolower($className);
$response = array();
foreach ( $referenceNode->getElementsByTagName("*") as $node ) {
$nodeClass = strtolower($node->getAttribute("class"));
if (
$nodeClass == $className ||
preg_match("/\b" . $className . "\b/", $nodeClass)
) {
$response[] = $node;
}
}
if ( $index !== false ) {
return isset($response[$index]) ? $response[$index] : false;
}
return $response;
}
$doc = new DOMDocument();
$doc->loadHTML($html);
foreach ( DOM_getElementByClassName($doc, $class) as $node ) {
$node->nodeValue = $content;
}
echo $doc->saveHTML();
If you are sure that $html is valid HTML code, you could use a HTML parser or even XML parser if it's valid XML code.
But the quick and dirty way in Regex would be something like:
$html = preg_replace('/(<[^>]+ class="[^>]*' . $class . '[^"]*"[^>]*>)[^<]+(<\/[^>]+>)/siU', '$1' . $content . '$2', $html);
Didn't test it too much, but it should work. Tell me if you find cases where it doesn't. ;)
Edit: Added "and dirty"... ;)
Edit 2: New version of the RegEx:
<?php
$class = "blah";
$content = "new content";
$html = '<div class="blah test"><h1><span>hello</span> world</h1></div><div class="other">other content</div><h2 class="blah">remove this</h2>';
$html = preg_replace('/<([\w]+)(\s[^>]*class="[^"]*' . $class . '[^"]*"[^>]*>).+(<\/\\1>)/siU', '<$1$2' . $content . '$3', $html);
echo $html;
?>
The last problem left is if theres a class that only has "blah" in its name, like "tooMuchBlahNow". Let's see how we can address that. Btw: Is it obvious already that I love playing with RegEx? ;)
There is no need to use the DOM class, this would probably be done quickest using jQuery, as Khnle said, or you could use the preg_replace() function. Give me some time, I may write a quick regex for you.
But I would recommend using something like jQuery, this way you can serve the page up to the user quickly and allow their computer to do the processing instead of your server.

Categories