I am trying to extract the html content from inside a website. I want only the content inside the tags.
//$validLink is a link with .htm extension, source code is rather large
//contains 24,000 lines of html code
$thehtml = file_get_contents($validlink);
$thehtml = preg_match("/<body.*?>(.*?)<\/body>/is", $thehtml);
What else can I do? $thehtml is empty.... I am trying to insert this into a wordpress post... but $thehtml is empty.... for some odd reason. Is there a possible timeout issue or something???
There can't be a timeout issue..... due to the fact that I noticed that if I output just file_get_contents($validlink); for some reason BODY is not found.....
Another possible solution would be just to get the content between the first div and the last div found in the document....
get the string position using 'strpos()' of both tag starting and ending then use sub string method i.e, substr() with this positions
$thehtml = file_get_contents($validlink);
$thehtml = preg_match("/<body.*?>(.*?)<\/body>/is", $thehtml,$matches);
$thehtml = $matches[0];
Here is the correct code:
$thehtml = file_get_contents($validlink);
preg_match('/<body.*?>(.*?)<\/body>/is', $thehtml, $matches);
$thehtml = $matches[1];
But I suggest you to use DOM parser instead.
Related
I need to check if a webpage outside of my site has a specific word on it. I’ve tried file_get_contents() but it doesn’t return anything. Is there any way I can do this in PHP?
edit: Here’s what I’ve tried:
$query = 'example';
$file = "https:// www.site.com/search?q=$query";
// tested url and it works, had to add space to post it
$contents = file_get_contents($file);
echo $contents;
I was expecting it to just output the entire page for me to use .includes() on later but it just doesn’t output anything.
Look into curl to get the contents of a web page. Then you can use preg_match to find the word.
From the official manual I know that I can get all the comments with the following code:
// Find all comment (<!--...-->) blocks
$es = $html->find('comment');
But this creates an array of comment nodes. I want to get the content of the comments as string. How could I do that?
I've tried with $es->plaintext, $es->innertext and $es->outertext.
Here is an example of what I want:
HTML:
...
<div id='a'>
<!-- Some text -->
</div>
...
PHP:
...
$content = $html->find('div[id=a]', 0)->find('comment', 0)->some_attr;
echo 'Content:'.$content;
Browser:
Content: Some text
Thanks in advance !
I've found the solution!
When we load an html with SimpleHTMLDom, the comments (scripts and others things) are removed from document and saved inside an array called 'noise'.
We can get a comment/script/etc searching an string pattern in the whole list of noises and there is a function to do that.
This is the solution:
$html->search_noise($subString);
So, in my own example, the solution can be:
1.- $comment = $html->search_noise('Some');
2.- $comment = $html->search_noise('text');
3.- $comment = $html->search_noise('me te');
4.- etc etc
The search_noise function returns the first noise that match the pattern, so, we have to be a little careful with the chosen sub-string.
I was trying to get data from a webpage using PHP and file_get_contents along with regular expressions, but I can't seem to get the correct data from the page.
Here is my code,
<?php
$homepage = file_get_contents('http://www.website.com');
preg_match_all('/<p><b>(.*)<\ /b><br>(.*)<br>(.*)<\ /p>/ms', $homepage, $matches);
$def = $matches[0];
echo $def;
?>
My regular expressions aren't picking up anything even though there is html code that matches the expressions. As a test I also tried replacing the first preg_match function with the following one.
preg_match_all('/<div>(.*)<\ /div>/ms', $homepage, $matches);
This only picked up 2 of the many div tags on the page. What is wrong with my code and what is the correct way it should be written?
Thanks
Instead of using RegEx you could simply use PHP's Document Object Model.
$homepage = file_get_contents('http://www.website.com');
$DOM = new DOMDocument;
$DOM->loadHTML($homepage);
$items = $DOM->getElementsByTagName('div');
$def = $items->item(0)->nodeValue;
(referenced form this question).
What I want to do:
I have a div with an id. Whenever ">" occurs I want to replace it with ">>". I also want to prefix the div with "You are here: ".
Example:
<div id="bbp-breadcrumb">Home > About > Contact</div>
Context:
My div contains breadcrumb links for bbPress but I'm trying to match its format to a site-wode bread crumb plugin that I'm using for WordPress. The div is called as function in PHP and outputted as HTML.
My question:
Do I use PHP of Javascript to replace the symbols and how do I go about calling the contents of the div in the first place?
Find the code that's generating the <, and either set the appropriate option (breadcrumb_separator or so) or modify the php code to change the separator.
Modifying supposedly static text with JavaScript is not only a maintenance nightmare, extremely brittle, and might lead to a strange rendering (as users see your site being modified if their system is slow), but will also not work in browsers without (or with disabled) JavaScript support.
You could use CSS to add the you are here text:
#bbp-breadcrumb:before {
content: "You are here: ";
}
Browser support:
http://www.quirksmode.org/css/beforeafter_content.html
You could change the > to >> with javascript:
var htmlElement = document.getElementById('bbp-breadcrumb');
htmlElement.innerHTML = htmlElement.innerHTML.split('>').join('>>').split('>').join('>>')
I don't recommend altering content like this, this is really hacky. You'd better change the ouput rendering of the breadcrumb plugin if possible. Within Wordpress this should be doable.
you can use a regex to match the breadcrumb content.. make the changes on it.. and put it back in the context..
check if this helps you:
$the_existing_html = 'somethis before<div id="bbp-breadcrumb">Home > About > Contact</div>something after'; // let's say this is your curreny html.. just added some context
echo $the_existing_html, '<hr />'; // output.. so that you can see the difference at the end
$pattern ='|<div(.*)bbp-breadcrumb(.*)>(.*)<\/div>|sU'; // find some text that is in a div that has "bbp-breadcrumb" somewhere in its atributes list
$all = preg_match_all($pattern, $the_existing_html, $matches); // match that pattern
$current_bc = $matches[3][0]; // get the text inside that div
$new_bc = 'You are here: ' . str_replace('>', '>>', $current_bc);// replace entity for > with the same thing repeated twice
$the_final_html = str_replace($current_bc, $new_bc, $the_existing_html); // replace the initial breadcrumb with the new one
echo $the_final_html; // output to see where we got
After using curl i've got from an external page i've got all source code with something like this (the part i'm interested)
(page...)<td valign='top' class='rdBot' align='center'><img src="/images/buy_tickets.gif" border="0" alt="T"></td> (page...)
So i'm using preg_match_all, i want to get only "buy_tickets.gif"
$pattern_before = "<td valign='top' class='rdBot' align='center'>";
$pattern_after = "</td>";
$pattern = '#'.$pattern_before.'(.*?)'.$pattern_after.'#si';
preg_match_all($pattern, $buffer, $matches, PREG_SET_ORDER);
Everything fine up to now... but the problem it's becase sometimes that external pages changes and the image i'm looking for it's inside a link
(page...)<td valign='top' class='rdBot' align='center'><img src="/images/buy_tickets.gif" border="0" alt="T"></td> (page...)
and i dunno how to get always my code to work (not just when the image gets no link)
hope u understand
thanks in advance
Don't use regex to parse HTML, Use PHP's DOM Extension. Try this:
$doc = new DOMDocument;
#$doc->loadHTMLFile( 'http://ventas.entradasmonumental.com/eventperformances.asp?evt=18' ); // Using the # operator to hide parse errors
$xpath = new DOMXPath( $doc );
$img = $xpath->query( '//td[#class="BrdBot"][#align="center"][1]//img[1]')->item( 0 ); // Xpath->query returns a 'DOMNodeList', get the first item which is a 'DOMElement' (or null)
$imgSrc = $img->getAttribute( 'src' );
$imgSrcInfo = pathInfo( $imgSrc );
$imgFilename = $imgSrcInfo['basename']; // All you need
You're going to get lots of advice not to use regex for pulling stuff out of HTML code.
There are times when it's appropriate to use regex for this kind of thing, and I don't always agree with the somewhat rigid advice given on the subject here (and elsewhere). However in this case, I would say that regex is not the appropriate solution for you.
The problem with using regex for searching for things in HTML code is exactly the problem you've encountered -- HTML code can vary wildly, making any regex virtually impossible to get right.
It is just about possible to write a regex for your situation, but it will be an insanely complex regex, and very brittle -- ie prone to failing if the HTML code is even slightly outside the parameters you expect.
Contrast this with the recommended solution, which is to use a DOM parser. Load the HTML code into a DOM parser, and you will immediately have an object structure which you can query for individual elements and attributes.
The details you've given make it almost a no-brainer to go with this rather than a regex.
PHP has a built-in DOM parser, which you can call as follows:
$mydom = new DOMDocument;
$mydom->loadHTMLFile("http://....");
You can then use XPath to search the DOM for your specific element or attribute that you want:
$myxpath = new DOMXPath($mydom);
$myattr = $xpath->query("//td[#class="rdbot"]//img[0]#src");
Hope that helps.
function GetFilename($file) {
$filename = substr($file, strrpos($file,'/')+1,strlen($file)-strrpos($file,'/'));
return $filename;
}
echo GetFilename('/images/buy_tickets.gif');
This will output buy_tickets.gif
Do you only need images inside of the "td" tags?
$regex='/<img src="\/images\/([^"]*)"[^>]*>/im';
edit:
to grab the specific image this should work:
$regex='/<td valign=\'top\' class=\'rdBot\' align=\'center\'>.*src="\/images\/([^"]*)".*<\/td>/
Parsing HTML with Regex is not recommended, as has been mentioned by several posters.
However, if the path of your images always follows the pattern src="/images/name.gif", you can easily extract it in Regex:
$pattern = <<<EOD
#src\s*=\s*['"]/images/(.*?)["']#
EOD;
If you are sure that the images always follow the path "/images/name.ext" and that you don't care where the image link is located in the page, this will do the job. If you have more detailed requirements (such matching only within a specific class), forget Regex, it's not the right tool for the job.
I just read in your comments that you need to match within a specific tag. Use a parser, it will save you untold headaches.
If you still want to go through regex, try this:
\(?<=<td .*?class\s*=\s*['"]rdBot['"][^<>]*?>.*?)(?<!</td>.*)<img [^<>]*src\s*=\s*["']/images/(.*?)["']\i
This should work. It does work in C#, I am not totally sure about php's brand of regex.