I was trying to get data from a webpage using PHP and file_get_contents along with regular expressions, but I can't seem to get the correct data from the page.
Here is my code,
<?php
$homepage = file_get_contents('http://www.website.com');
preg_match_all('/<p><b>(.*)<\ /b><br>(.*)<br>(.*)<\ /p>/ms', $homepage, $matches);
$def = $matches[0];
echo $def;
?>
My regular expressions aren't picking up anything even though there is html code that matches the expressions. As a test I also tried replacing the first preg_match function with the following one.
preg_match_all('/<div>(.*)<\ /div>/ms', $homepage, $matches);
This only picked up 2 of the many div tags on the page. What is wrong with my code and what is the correct way it should be written?
Thanks
Instead of using RegEx you could simply use PHP's Document Object Model.
$homepage = file_get_contents('http://www.website.com');
$DOM = new DOMDocument;
$DOM->loadHTML($homepage);
$items = $DOM->getElementsByTagName('div');
$def = $items->item(0)->nodeValue;
(referenced form this question).
Related
I am trying to get the plaintext from the given html. But, it is not possible for me.
for this, what I had done is
My html is in $content variable
Now, I am passing $content variable to php DomDocuemnt
$d = new DOMDocument();
#$d->loadHTML($content)
Whats my next step to get the plaintext from the obtained html.
Please help me in this. Thanks in advance!
I can't understand your question but if you want the HTML code as string then
Try this...
$d = new DOMDocument();
$d->loadHTML($content);
$plainText = $d->textContent;
echo $plainText;
The DOM itself does not have such functionality. You may use the strip_tags() function though. Like this:
$d = new DOMDocument();
$d->loadHTML($content);
$plainText = strip_tags($d->textContent);
echo $plainText;
// which is probably equivalent to:
$plainText = strip_tags($content);
Note: using the DOMDocument() is useful to test that $content is correct or if you want to get a specific tag ($main = $d->getElementByName('<main>'); $plainText = strip_tags($main[0]->textContent)) otherwise directly using strip_tags() is enough.
There are some problems as the strip_tags() function has no clue about the type of tag being removed. This means a sequence such as:
... word</p><p>more ...
will concatenate those two words:
... wordmore ...
This is a difficult problem since some tags are expected to be removed that way and others not. For example, if the user had some form of emphasis, no spaces is the right way of removing the tag:
che<u>val<u> -> cheval
che<u>veaux<u> -> cheveaux
(Singular and plural of "horse" in French)
A browser has no clue either, the CSS is what tells whether a tag is a block (<div>) or inline (<u>).
Alright, I have some code that will find a <code></code> tag set and clean up any code inside of it so it displays instead of functioning like regular code. Everything works, but my problem is how can I find the tag set/multiple tag sets inside, say, $content. Clean the code, and still have ALL of the other content in it? Here is my code, the problem is it checks for matches, and when it finds one it cleans it. But after it cleans it it has no way to put it back into it's original position $content. ($content is being grabbed from a form)
<?php
preg_match_all("'<code>(.*?)</code>'si", $html, $match);
if ($match) {
foreach ($match[1] as $snippet) {
$fixedCode = htmlspecialchars($snippet, ENT_QUOTES);
}
}
?>
What do I do with $fixedCode, now that it is clean?
Using regex for parsing HTML is bad. I'd suggest getting familiar with a DOM parser, such as PHP's DOM module.
The DOM extension allows you to operate on XML documents through the DOM API with PHP 5.
Using the DOM module, in order to get the HTML/data from <code> tags in the document, you'd want to do something like this:
<?php
//So many variables!
$html = "<div> Testing some <code>code</code></div><div>Nother div, nother <code>Code</code> tag</div>";
$dom_doc = new DOMDocument;
$dom_doc->loadHTML($html);
$code = $dom_doc->getElementsByTagName('code');
foreach ($code as $scrap) {
echo htmlspecialchars($scrap->nodeValue, ENT_QUOTES), "<br />";
}
?>
After using curl i've got from an external page i've got all source code with something like this (the part i'm interested)
(page...)<td valign='top' class='rdBot' align='center'><img src="/images/buy_tickets.gif" border="0" alt="T"></td> (page...)
So i'm using preg_match_all, i want to get only "buy_tickets.gif"
$pattern_before = "<td valign='top' class='rdBot' align='center'>";
$pattern_after = "</td>";
$pattern = '#'.$pattern_before.'(.*?)'.$pattern_after.'#si';
preg_match_all($pattern, $buffer, $matches, PREG_SET_ORDER);
Everything fine up to now... but the problem it's becase sometimes that external pages changes and the image i'm looking for it's inside a link
(page...)<td valign='top' class='rdBot' align='center'><img src="/images/buy_tickets.gif" border="0" alt="T"></td> (page...)
and i dunno how to get always my code to work (not just when the image gets no link)
hope u understand
thanks in advance
Don't use regex to parse HTML, Use PHP's DOM Extension. Try this:
$doc = new DOMDocument;
#$doc->loadHTMLFile( 'http://ventas.entradasmonumental.com/eventperformances.asp?evt=18' ); // Using the # operator to hide parse errors
$xpath = new DOMXPath( $doc );
$img = $xpath->query( '//td[#class="BrdBot"][#align="center"][1]//img[1]')->item( 0 ); // Xpath->query returns a 'DOMNodeList', get the first item which is a 'DOMElement' (or null)
$imgSrc = $img->getAttribute( 'src' );
$imgSrcInfo = pathInfo( $imgSrc );
$imgFilename = $imgSrcInfo['basename']; // All you need
You're going to get lots of advice not to use regex for pulling stuff out of HTML code.
There are times when it's appropriate to use regex for this kind of thing, and I don't always agree with the somewhat rigid advice given on the subject here (and elsewhere). However in this case, I would say that regex is not the appropriate solution for you.
The problem with using regex for searching for things in HTML code is exactly the problem you've encountered -- HTML code can vary wildly, making any regex virtually impossible to get right.
It is just about possible to write a regex for your situation, but it will be an insanely complex regex, and very brittle -- ie prone to failing if the HTML code is even slightly outside the parameters you expect.
Contrast this with the recommended solution, which is to use a DOM parser. Load the HTML code into a DOM parser, and you will immediately have an object structure which you can query for individual elements and attributes.
The details you've given make it almost a no-brainer to go with this rather than a regex.
PHP has a built-in DOM parser, which you can call as follows:
$mydom = new DOMDocument;
$mydom->loadHTMLFile("http://....");
You can then use XPath to search the DOM for your specific element or attribute that you want:
$myxpath = new DOMXPath($mydom);
$myattr = $xpath->query("//td[#class="rdbot"]//img[0]#src");
Hope that helps.
function GetFilename($file) {
$filename = substr($file, strrpos($file,'/')+1,strlen($file)-strrpos($file,'/'));
return $filename;
}
echo GetFilename('/images/buy_tickets.gif');
This will output buy_tickets.gif
Do you only need images inside of the "td" tags?
$regex='/<img src="\/images\/([^"]*)"[^>]*>/im';
edit:
to grab the specific image this should work:
$regex='/<td valign=\'top\' class=\'rdBot\' align=\'center\'>.*src="\/images\/([^"]*)".*<\/td>/
Parsing HTML with Regex is not recommended, as has been mentioned by several posters.
However, if the path of your images always follows the pattern src="/images/name.gif", you can easily extract it in Regex:
$pattern = <<<EOD
#src\s*=\s*['"]/images/(.*?)["']#
EOD;
If you are sure that the images always follow the path "/images/name.ext" and that you don't care where the image link is located in the page, this will do the job. If you have more detailed requirements (such matching only within a specific class), forget Regex, it's not the right tool for the job.
I just read in your comments that you need to match within a specific tag. Use a parser, it will save you untold headaches.
If you still want to go through regex, try this:
\(?<=<td .*?class\s*=\s*['"]rdBot['"][^<>]*?>.*?)(?<!</td>.*)<img [^<>]*src\s*=\s*["']/images/(.*?)["']\i
This should work. It does work in C#, I am not totally sure about php's brand of regex.
i get the html from another site with file_get_contens, my question is how can i get a specific tag value?
let's say i have:
<div id="global"><p class="paragraph">1800</p></div>
how can i get paragraph's value? thanks
If the example is really that trivial you could just use a regular expression. For generic HTML parsing though, PHP has DOM support:
$dom = new domDocument();
$dom->loadHTML("<div id=\"global\"><p class=\"paragraph\">1800</p></div>");
echo $dom->getElementsByTagName('p')->item(0)->nodeValue;
You need to parse the HTML. There are several ways to do this, including using PHP's XML parsing functions.
However, if it is just a simple value (as you asked above) I would use the following simple code:
// your content
$contents='<div id="global"><p class="paragraph">1800</p></div>';
// define start and end position
$start='<div id="global"><p class="paragraph">';
$end='</p></div>';
// find the stuff
$contents=substr($contents,strpos($contents,$start)+strlen($start));
$contents=substr($contents,0,strpos($contents,$end));
// write output
echo $contents;
Best of luck!
Christian Sciberras
(tested and works)
$input = '<div id="global"><p class="paragraph">1800</p></div>';
$output = strip_tags($input);
preg_match_all('#paragraph">(.*?)<#is', $input, $output);
print_r($output);
Untested.
I'm new to Regular Expressions and things like that. I have only few knowledge and I think my current problem is about them.
I have a webpage, that contains text. I want to get links from the webpage that are only in SPANs that have class="img".
I go through those steps.
grab all the SPANs tagged with the "img" class (this is the hard step that I'm looking for)
move those SPANs to a new variable
Parse the variable to get an array with the links (Each SPAN has only 1 link, so this will be easy)
I'm using PHP, but any other language doesn't matter, I'm looking how to deal with the first step. Any one have a suggestion?
Thanks :D
Use PHPs DOMDocument-class in combination with the DOMXPath-class to navigate to the elements you need, like this:
<?php
$dom = new DOMDocument();
$dom->loadHTML(file_get_contents('http://foo.bar'));
$xpath = new DOMXPath($dom);
$elements = $xpath->query("/html/body//span[#class='img']//a");
foreach ($elements as $a)
{
echo $a->getAttribute('href'), "\n";
}
You can learn more about the XPath Language on the W3C page.
A pattern like <span.* class="img".*>([^<]*)</span> should work fine., assuming your code looks something like
<span class="img">http://www.img.com/img.jpg</span>
<span alt="yada" class="img">animage.png</span>
<span alt="yada" class="img" title="still works">link.txt</span>
<span>not an img class</span>
<?php
$pattern = '#<span.* class="img".*>([^<]*)</span>#i';
//$subject = html code above
preg_match_all($pattern, $subject, $matches);
print_r($matches);
?>
I'm using PHP, but any other language
doesn't matter, I'm looking how to
deal with the first step. Any one have
a suggestion?
We-e-ell...
import urllib
from BeautifulSoup import BeautifulSoup, SoupStrainer
html = urllib.urlopen(url).read()
sieve = SoupStrainer(name='span', attrs={'class': 'img'})
tag_soup = BeautifulSoup(html, parseOnlyThese=sieve)
for link in tag_soup('a'):
print link['href']
(that's python, using BeautifulSoup - should work on most douments, well-formed or no).