How do I extract HTML content using Regex in PHP - php

I know, i know... regex is not the best way to extract HTML text. But I need to extract article text from a lot of pages, I can store regexes in the database for each website. I'm not sure how XML parsers would work with multiple websites. You'd need a separate function for each website.
In any case, I don't know much about regexes, so bear with me.
I've got an HTML page in a format similar to this
<html>
<head>...</head>
<body>
<div class=nav>...</div><p id="someshit" />
<div class=body>....</div>
<div class=footer>...</div>
</body>
I need to extract the contents of the body class container.
I tried this.
$pattern = "/<div class=\"body\">\(.*?\)<\/div>/sui"
$text = $htmlPageAsIs;
if (preg_match($pattern, $text, $matches))
echo "MATCHED!";
else
echo "Sorry gambooka, but your text is in another castle.";
What am I doing wrong? My text ends up in another castle.
*EDIT: ooohh... never mind, I found readability's code

You are matching for class="body" your document has class=body: you're missing the quotes. Use "/<div class=\"?body\"?>(.*?)<\/div>/sui".

Related

PHP get text from tag with regex

i want to get all text from thiw blow tag and put thats into array with regex
<div class="titr2">TEXT </div>
TEXT is utf-8 and i can not get that with using regex
<meta charset='UTF-8' />
<?php
error_reporting(1);
$handle='http://www.namefa.ir/Names.asp?pn=3&sx=F&fc=%D8%A8';
$handle = file_get_contents($handle);
preg_match_all('<div class="titr2" href=".*">(.*)</div>)siU', $string, $matching_data);
print_r($matching_data);
?>
Try to use this regexp:
preg_match_all('/<div[^>]+class="titr2"[^>]*>\s*<a[^>]+>(.*?)<\/a>\s*<\/div>/si', $handle, $matching_data);
You shouldn't use regex to parse HTML: RegEx match open tags except XHTML self-contained tags
You should really use an HTML parser instead.
If this really is a one-time thing, limited to this case only, in a small HTML file that never changes, your regex is wrong:
<div class="titr2">(.+?)</div>
would be closer and you should checkout Victor's solution.

Replace the content inside a DIV

I have a div called
<div id="form">Content</div>
and I want to replace the content of the div with new content using Preg_replace.
what Regex should be used.?
You shouldn't be using a regex at all. HTML can come in many forms, and you would need to take all of them in account. What if the id/class doesn't come in the place you expect? The regex would have to be really complex to get you reasonable results.
Instead, you should use a DOM parser - or a really cool tool I recently stumbled across, phpQuery. With it, you can access your document in PHP almost exactly as you would with jQuery.
This will work in your case:
$html = '<div id="content">Content</div>';
$html = preg_replace('/(<\s*div[^>]*>)[^<]*(<\s*\/div\s*>)/', '$1New Content$2', $html);
echo $html; // <div id="content">New Content</div>
However note that since HTML is not a regular language it is impossible to handle all cases. The simple regex I provided will produce bad output in the following example:
<div class=">">Content</div>

php match pattern to get images from text file

i have seen many answers when people ask how to grab and extract the images actual URLs, from a web page content / text ect, however, in my database, sadly, i have this syntax:
<img class="photo" src="http://domain.com/image.jpg" alt="alt goes here" />
So, the typical way $pattern = '/src=["|\']([^"|\']+)/is'; does not work in my case due to those "...
Have been trying for hours, i must be doing something very very wrong...
Any help is much appreciated!
First of all, the 'usual way' is to use an HTML/XML parser, not regular expressions.
Secondly, what you have is HTML code encoded as HTML text, which smells badly for two reasons:
it's not HTML any more (why encode it as HTML text when it is in fact HTML code)?
you shouldn't encode HTML before putting it into DB, but rather when writing it to the user.
With these two issues aside, what you need to do is to htmlspecialchars_decode() that stuff and pass it through an HTML parser:
$stuff = '<img class="photo" src="http://domain.com/image.jpg" alt="alt goes here" />';
$code = htmlspecialchars_decode($stuff, ENT_QUOTES);
$xml = simplexml_load_string($code);
That said, to me this sounds like a hack to fix badly written code. But there may be a valid reason why it's there in the first place.
Dont use Regular expression!
Use XML/DOM libraries like Simple HTML DOM.
BTW, the regular expression you are looking for is,
$pattern = '/src=(["\'])(.+)(?=\1)/i';
Test Case (Optional):
Here is a simple program to test it. Obviously you need to use htmlspecialchars_decode() first to decode it from entity format.
$str = array(
"<script type=\"text/javascript\" src=\"script.js\"></script>",
"<script type=\"text/javascript\" src='script.js'></script>",
'<script type="text/javascript" src="script.js"></script>',
'<script type="text/javascript" src=\'script.js\'></script>',
);
$pattern = '/src=(["\'])(.+)(?=\1)/i';
foreach($str as $s){
preg_match($pattern, $s, $m);
echo $m[2], PHP_EOL;
}
Output
script.js
script.js
script.js
script.js
You can test Regex here:
http://gskinner.com/RegExr/
What's not working?

Select the first paragraph tag not contained in within another tag using RegEx (Perl-style)

I have this block of html:
<div>
<p>First, nested paragraph</p>
</div>
<p>First, non-nested paragraph.</p>
<p>Second paragraph.</p>
<p>Last paragraph.</p>
I'm trying to select the first, non-nested paragraph in that block. I'm using PHP's (perl style) preg_match to find it, but can't seem to figure out how to ignore the p tag contained within the div.
This is what I have so far, but it selects the contents of the first paragraph contained above.
/<p>(.+?)<\/p>/is
Thanks!
EDIT
Unfortunately, I don't have the luxury of a DOM Parser.
I completely appreciate the suggestions to not use RegEx to parse HTML, but that's not really helping my particular use case. I have a very controlled case where an internal application generated structured text. I'm trying to replace some text if it matches a certain pattern. This is a simplified case where I'm trying to ignore text nested within other text and HTML was the simplest case I could think of to explain. My actual case looks something a little more like this (But a lot more data and minified):
#[BILLINGCODE|12345|11|15|2001|15|26|50]#
[ITEM1|{{Escaped Description}}|1|1|4031|NONE|15]
#[{{Additional Details }}]#
[ITEM2|{{Escaped Description}}|3|1|7331|NONE|15]
[ITEM3|{{Escaped Description}}|1|1|9431|NONE|15]
[ITEM4|{{Escaped Description}}|1|1|5131|NONE|15]
I have to reformat a certain column of certain rows to a ton of rows similar to that. Helping my first question would help actual project.
Your regex won't work. Even if you had only non nested paragraph, your capturing parentheses would match First, non-nested ... Last paragraph..
Try:
<([^>]+)>([^<]*<(?!/?\1)[^<]*)*<\1>
and grab \2 if \1 is p.
But an HTML parser would do a better job of that imho.
How about something like this?
<p>([^<>]+)<\/p>(?=(<[^\/]|$))
Does a look-ahead to make sure it is not inside a closing tag; but can be at the end of a string. There is probably a better way to look for what is in the paragraph tags but you need to avoid being too greedy (a .+? will not suffice).
Use a two three step process. First, pray that everything is well formed. Second, First, remove everything that is nested.
s{<div>.*?</div>}{}g; # HTML example
s/#.*?#//g; # 2nd example
Then get your result. Everything that is left is now not nested.
$result = m{<p>(.*?)</p>}; # HTML example
$result = m{\[(.*?)\]}; # 2nd example
(this is Perl. Don't know how different it would look in PHP).
"You shouldn't use regex to parse HTML."
It is what everybody says but nobody really offers an example of how to actually do it, they just preach it. Well, thanks to some motivation from Levi Morrison I decided to read into DomDocument and figure out how to do it.
To everybody that says "Oh, it is too hard to learn the parser, I'll just use regex." Well, I've never done anything with DomDocument or XPath before and this took me 10 minutes. Go read the docs on DomDocument and parse HTML the way you're supposed to.
$myHtml = <<<MARKUP
<html>
<head>
<title>something</title></head>
<body>
<div>
<p>not valid</p>
</div>
<p>is valid</p>
<p>is not valid</p>
<p>is not valid either</p>
<div>
<p>definitely not valid</p>
</div>
</body>
</html>
MARKUP;
$DomDocument = new DOMDocument();
$DomDocument->loadHTML($myHtml);
$DomXPath = new DOMXPath($DomDocument);
$nodeList = $DomXPath->query('body/p');
$yourNode = $DomDocument->saveHtml($nodeList->item(0));
var_dump($yourNode)
// output '<p>is valid</p>'
You might want to have a look at this post about parsing HTML with Regex.
Because HTML is not a regular language (and Regular Expressions are), you can't pares out arbitrary chunks of HTML using Regex. Use an HTML parser, it'll get the job done considerably more smoothly than trying to hack together some regex.

How do I grab part of a page's HTML DOM with PHP?

I'm grabbing data from a published google spreadsheet, and all I want is the information inside of the content div (<div id="content">...</div>)
I know that the content starts off as <div id="content"> and ends as </div><div id="footer">
What's the best / most efficient way to grab the part of the DOM that is inside there? I was thinking regular expression (see my example below) but it is not working and I'm not sure if it that efficient...
header('Content-type: text/plain');
$foo = file_get_contents('https://docs.google.com/spreadsheet/pub?key=0Ahuij-1M3dgvdG8waTB0UWJDT3NsUEdqNVJTWXJNaFE&single=true&gid=0&output=html&ndplr=1');
$start = '<div id="content">';
$end = '<div id="footer">';
$foo = preg_replace("#$start(.*?)$end#",'$1',$foo);
echo $foo;
UPDATE
I guess another question I have is basically about if it is just simpler and easier to use regex with start and end points rather than trying to parse through a DOM which might have errors and then extract the piece I need. Seems like regex would be the way to go but would love to hear your opinions.
Try changing your regex to $foo = preg_replace("#$start(.*?)$end#s",'$1',$foo); , the s modifier changes the . to include new lines. As it is, your regex would have to all the content between the tags on the same line to match.
If your HTML page is any more complex than that, then regex probably won't cut it and you'd need to look into a parser like DOMDocument or Simple HTML DOM
if you have a lot to do, I would recommend you take a look at http://simplehtmldom.sourceforge.net
really good for this sort of thing.
Do not use regex, it can fail.
Use PHP's inbuilt DOM parse :
http://php.net/manual/en/class.domdocument.php
You can easily traverse and parse relevant content .

Categories