Regex Everything Including Whitespace - php

I need a regular expression pattern all characters including whitespace what is not a variable in PHP.
<li class="xyz" data-name="abc">
<span id="XXX">some words</span>
<div data-attribute="values">
<a class="klm" href="http://example.com/blabla">somethings</a>
</div>
<div class="xyz sub" data-name="abc-sub"><img src="/images/any_image.jpg" class="qqwwee"></div>
</li><!--repeating li tags-->
I wrote a pattern;
preg_match_all('#<li((?s).*?)<div((?s).*?)href="((?s).*?)"((?s).*?)</li>#', $subject, $matches);
This works well but I don't want to get four variables. I just want to get
http://example.com/blabla
And anyone can tell me why this does not work like that?
preg_match_all('#<li[[?s].*?]<div[[?s].*?]href="((?s).*?)"[[?s].*?]</li>#', $subject, $matches);

Using (?:) will allow grouping but make those groups not captured, for example, the following:
#<li(?:(?s).*?)<div(?:(?s).*?)href="((?s).*?)"(?:(?s).*?)</li>#
Will output:
array (
0 =>
array (
0 => '<li class="xyz" data-name="abc">
<span id="XXX">some words</span>
<div data-attribute="values">
<a class="klm" href="http://example.com/blabla">somethings</a>
</div>
<div class="xyz sub" data-name="abc-sub"><img src="/images/any_image.jpg" class="qqwwee"></div>
</li>',
),
1 =>
array (
0 => 'http://example.com/blabla',
),
)
All of your matches will be contained in $matches[1], so iterate through that.

Don't use RegExps to parse HTML
Read this famous answer on StackOverflow.
HTML is not a regular language, so it cannot be reliably processed with a RegExp. Instead, use a proper (and robust) HTML parser.
Also note that data mining (analysis) != web-scraping (data collection).
If you don't want a regexp group to store the "captured" data, use a non-capturing flag.
(?:some-complex-regexp-here)
In your case, the following may work:
(?s)<li.*?<div.*?href="([^"]*?)".*?</li>
But seriously, don't use regexps for this; regexps are fragile. Use an xpath like /li//div//a//#href instead.

Related

Matching wildcard without adding to the array with preg_match_all

I'm trying to capture the table text from an element that looks like this:
<span id="ctl00_MainContent_ListView2_ctrl2_ctl01_Label17" class="vehicledetailTable" style="display:inline-block;width:475px;">OWNED</span><br />
My preg_match_all looks like:
preg_match_all('~475px;">(.*?)</span><br />~', $ret, $vehicle);
The problem is there are other tables on the page that also match but have data not relevant to my query. The data that I want are all in "ListView2," but the "ct101_Label17" varies - Label18, Label19, Label20, etc.
Since I'm not interested in capturing the label, is there a method to match the subject string without capturing the match? Something along the lines of:
<span id="ctl00_MainContent_ListView2_ctrl2_ctl01_[**WILDCARD HERE**]" class="vehicledetailTable" style="display:inline-block;width:475px;">OWNED</span><br />
Any help would be greatly appreciated.
Here is a very poor solution that you are currently considering:
<span\b[^<>]*\bid="ctl00_MainContent_ListView2_ctrl2_ctl01_[^"]*"[^<>]*475px;">(.*?)</span><br\s*/>
See demo
It makes sure we found a <span> tag and there is id attribute starting with ctl00_MainContent_ListView2_ctrl2_ctl01_, and there is some attribute (and you know it is style) ending with 475px;, and then we just capture anything up to the closing </span> tag.
You can get this with DOM and XPath, which is a much safer solution that uses the same logic as above:
$html = "<span id=\"ctl00_MainContent_ListView2_ctrl2_ctl01_Label17\" class=\"vehicledetailTable\" style=\"display:inline-block;width:475px;\">OWNED</span><br />";
$dom = new DomDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$spans = $xpath->query("//span[starts-with(#id,'ctl00_MainContent_ListView2_ctrl2_ctl01_') and #class='vehicledetailTable' and contains(#style,'475px;')]");
$data = array();
foreach ($spans as $span) {
array_push($data, $span->textContent);
}
print_r($data);
Output: [0] => OWNED
Note that the XPath expression contains 3 conditions, feel free to modify any:
//span - get all span tags that
starts-with(#id,'ctl00_MainContent_ListView2_ctrl2_ctl01_') - have an attribute id with value starting with ctl00_MainContent_ListView2_ctrl2_ctl01_
#class='vehicledetailTable' - and have class attribute with value equal to vehicledetailTable
contains(#style,'475px;') - and have a style attribute whose value contains 475px;.
Conditions are enclosed into [...] and are joined with or or and. They can also be grouped with round brackets. You can also use not(...) to invert the condition. XPath is very helpful in such situations.

need help in Regular Expression

I am in a weird scenerio where I need to show the content in multiple columns. I am using css3 column-cont and jquery plugin columnizer for older versions of IE.
The problem is that I do not have complete control over the data as it is served by an external webservice.
In most cases the content is wrapped in multiple paragraph tabs
Content#1
<p><strong>Heading</strong><br>This is a content</p>
<p><strong>Heading</strong><br>This is a content</p>
But In few cases the data is not wrapped in <p> tag and looks like below:
Content#2
<strong>Day 1: xyz </strong><br>
lorem lipsum <br> <br>
<strong>Dag 2: lorem lipsum</strong><br>
Morgonflyg till Arequipa i södra Peru.
<br> <br>
The real problem is jquery columnizer plugin hangs up the browser with this markup when it is asked to columnize such content.
Now I want to transform Content#2 to Content#1 with the help of regular expression,ie wrap the contents into sensible paragraphs. I hope I have made myself clear
I am using PHP.
Thank you in advance!
Your content is not stable and Regular Expression won't do magics with distinct contents like this. With this being said, whenever you're receiving the data from the other website, there might be a high chance that someday it'll return different pattern so your rules won't be good anymore. You need to have a reliable source to get a reliable result.
This is a filthy string manipulation but it'll get what you need if the pattern stays consistent. And, I still insist that you have to use a reliable source.
$str = "<strong>Day 1: xyz </strong><br>
lorem lipsum <br> <br>
<strong>Dag 2: lorem lipsum</strong><br>
Morgonflyg till Arequipa i södra Peru.
<br> <br> ";
function parse($data)
{
if(substr($data, 0, 3) == "<p>") return $data;
$chunks = explode("<strong>", $data);
$out = array();
foreach($chunks as $chunk)
{
$item = $chunk;
$last_br = strpos($item, "<br> <br>");
if($last_br > -1){ $item = substr($item, 0, $last_br); }
$item = "<p>" . $item . "</p>";
$out[] = $item;
}
return implode("\n", $out);
}
echo parse($str);
You can use this pattern:
/(?<!^<p>)(<strong>.*?)(<strong>.*)$/gs
Demo
Notice that the exclusion in the negative lookbehind will ONLY work if your strings starts with a <p>... so consider to trim it before applying your regex...
<br> tags has to be removed using another regex or str_replace()
Also, consider maybe using another aproach than Regex to parse DOM HTML...

Extract Image SRC from string using preg_match_all

I have a string of data that is set as $content, an example of this data is as follows
This is some sample data which is going to contain an image in the format <img src="http://www.randomdomain.com/randomfolder/randomimagename.jpg">. It will also contain lots of other text and maybe another image or two.
I am trying to grab just the <img src="http://www.randomdomain.com/randomfolder/randomimagename.jpg"> and save it as another string for example $extracted_image
I have this so far....
if( preg_match_all( '/<img[^>]+src\s*=\s*["\']?([^"\' ]+)[^>]*>/', $content, $extracted_image ) ) {
$new_content .= 'NEW CONTENT IS '.$extracted_image.'';
All it is returning is...
NEW CONTENT IS Array
I realise my attempt is probably completly wrong but can someone tell me where I am going wrong?
Your first problem is that http://php.net/manual/en/function.preg-match-all.php places an array into $matches, so you should be outputting the individual item(s) from the array. Try $extracted_image[0] to start.
You need to use a different function, if you only want one result:
preg_match() returns the first and only the first match.
preg_match_all() returns an array with all the matches.
Using regex to parse valid html is ill-advised. Because there can be unexpected attributes before the src attribute, because non-img tags can trick the regular expression into false-positive matching, and because attribute values can be quoted with single or double quotes, you should use a dom parser. It is clean, reliable, and easy to read.
Code: (Demo)
$string = <<<HTML
This is some sample data which is going to contain an image
in the format <img src="http://www.randomdomain.com/randomfolder/randomimagename.jpg">.
It will also contain lots of other text and maybe another image or two
like this: <img alt='another image' src='http://www.example.com/randomfolder/randomimagename.jpg'>
HTML;
$srcs = [];
$dom=new DOMDocument;
$dom->loadHTML($string);
foreach ($dom->getElementsByTagName('img') as $img) {
$srcs[] = $img->getAttribute('src');
}
var_export($srcs);
Output:
array (
0 => 'http://www.randomdomain.com/randomfolder/randomimagename.jpg',
1 => 'http://www.example.com/randomfolder/randomimagename.jpg',
)

Search dynamic term twice in Regex

I know I can refer in replacement to dynamic parts of the term in regex in PHP:
preg_replace('/(test1)(test2)(test3)/',"$3$2$1",$string);
(Somehow like this, I don't know if this is correct, but its not what I am looking for)
I want that in the regex, like:
preg_match_all("~<(.*)>.*</$1>~",$string,$matches);
The first part between the "<" and ">" is dynamic (so every tag existing in html and even own xml tags can be found) and i want to refer on that again in the same regex-term.
But it doesn't work for me. Is this even possible?
I have a server with PHP 5.3
/edit:
my final goal is this:
if have a html-page with e. g. following source-code:
HTML
<html>
<head>
<title>Titel</title>
</head>
<body>
<div>
<p>
p-test<br />
br-test
</p>
<div>
<p>
div-p-test
</p>
</div>
</div>
</body>
</html>
And after processing it should look like
$htmlArr = array(
'html' => array(
'head' => array('title' => 'Titel'),
'body' => array(
'div0' => array(
'p0' => 'p-test<br />br-test',
'div1' => array(
'p1' => 'div-p-test'
)
)
)
));
Placeholders in the replacement string use the $1 syntax. In the regex itself they are called backreferences and follow the syntax \1 backslash and number.
http://www.regular-expressions.info/brackets.html
So in your case:
preg_match_all("~<(.*?)>.*?</\\1>~",$string,$matches);
The backslash is doubled here, because in PHP strings the backslash escapes itself. (In particular for double quoted strings, else it would become an ASCII symbol.)

How to Ignore Whitespaces using preg_match()

I have a string that looks like:
">ANY CONTENT</span>(<a id="show
I need to fetch ANY CONTENT. However, there are spaces in between
</span> and (<a id="show
Here is my preg_match:
$success = preg_match('#">(.*?)</span>\s*\(<a id="show#s', $basicPage, $content);
\s* represents spaces. I get an empty array!
Any idea how to fetch CONTENT?
Use a real HTML parser. Regular expressions are not really suitable for the job. See this answer for more detail.
You can use DOMDocument::loadHTML() to parse into a structured DOM object that you can then query, like this very basic example (you need to do error checking though):
$dom = new DOMDocument;
$dom->loadHTML($data);
$span = $dom->getElementsByTagName('span');
$content = $span->item(0)->textContent;
I just had to:
">
define the above properly, because "> were too many in the page, so it didn't know which one to choose specficially. Therefore, it returned everything before "> until it hits (
Solution:
.">
Sample:
$success = preg_match('#\.">(.*?)</span>\s*\(<a id="show#s', $basicPage, $content);

Categories