Preg Match Regular expression - php

What will the expression be for preg_match_all to get the content part of this string:
< meta itemprop="url" content="whatever content is here" >
So far I have tried:
preg_match_all('| < meta itemprop=/"url/" content=/"(.*?)/" > |',$input,$outputArray);

Try this expression:
<?php
$input = '< meta itemprop="url" content="whatever content is here" >';
preg_match_all('/content="(.*?)"/',$input,$outputArray);
print_r($outputArray);
?>
Output
Array
(
[0] => Array
(
[0] => content="whatever content is here"
)
[1] => Array
(
[0] => whatever content is here
)
)
Working Demo
Edit
If you want to fetch content of only itemprop="url", modify regex to
preg_match_all('/itemprop="url".*content="(.*?)"/',$input,$outputArray);

Related

Is it possible to exclude parts of the matched string in preg_match?

when writing a script that is supposed to download content from a specific div I was wondering if it is possible to skip some part of the pattern in such a way that it will not be included in the matching result.
examlple:
<?php
$html = '
<div class="items">
<div class="item-s-1827">
content 1
</div>
<div class="item-s-1827">
content 2
</div>
<div class="item-s-1827">
content 3
</div>
</div>
';
preg_match_all('/<div class=\"item-s-([0-9]*?)\">([^`]*?)<\/div>/', $html, $match);
print_r($match);
/*
Array
(
[0] => Array
(
[0] => <div class="item-s-1827">
content 1
</div>
[1] => <div class="item-s-1827">
content 2
</div>
[2] => <div class="item-s-1827">
content 3
</div>
)
[1] => Array
(
[0] => 1827
[1] => 1827
[2] => 1827
)
[2] => Array
(
[0] =>
content 1
[1] =>
content 2
[2] =>
content 3
) ) */
Is it possible to omit class=\"item-s-([0-9]*?)\" In such a way that the result is not displayed in the $match variable?
In general, you can assert strings precede or follow your search string with positive lookbehinds / positive lookaheads. In the case of a lookbehind, the pattern must be of a fixed length which stands in conflict with your requirements. But fortunately there's a powerful alternative to that: You can make use of \K (keep text out of regex), see http://php.net/manual/en/regexp.reference.escape.php:
\K can be used to reset the match start since PHP 5.2.4. For example, the patter foo\Kbar matches "foobar", but reports that it has matched "bar". The use of \K does not interfere with the setting of captured substrings. For example, when the pattern (foo)\Kbar matches "foobar", the first substring is still set to "foo".
So here's the regex (I made some additional changes to that), with \K and a positive lookahead:
preg_match_all('/<div class="item-s-[0-9]+">\s*\K[^<]*?(?=\s*<\/div>)/', $html, $match);
print_r($match);
prints
Array
(
[0] => Array
(
[0] => content 1
[1] => content 2
[2] => content 3
)
)
The preferred way to parse HTML in PHP is to use DomDocument to load the HTML and then DomXPath to search the result object.
Update
Modified based on comments to question so that <div> class names just have to begin with item-s-.
$html = '<div class="items">
<div class="item-s-1827">
content 1
</div>
<div class="item-s-18364">
content 2
</div>
<div class="item-s-1827">
content 3
</div>
</div>';
$doc = new DomDocument();
$doc->loadHTML($html);
$xpath = new DomXPath($doc);
$divs = $xpath->query("//div[starts-with(#class,'item-s-')]");
foreach ($divs as $div) {
$values[] = trim($div->nodeValue);
}
print_r($values);
Output:
Array (
[0] => content 1
[1] => content 2
[2] => content 3
)
Demo on 3v4l.org

PHP preg_match_all cutting out text elements

Got a question.
I have a HTML string that I have imported into PHP. file_get_contents.
What I want to do is to cut out some strings at the hand of tags like {{ or [tagname] and [/tagname]
And then return them in an array so I can process them.
For example :
{{header.tpl}} to have PHP load the header.tpl file and replace {{header.tpl}} with its contents.
I THINK that regex is the way to go. But that's exactly where my weak spot is. I have tried but to no avail.
I got as far as the following code:
<?php
$text = '
Hi this is a text<br />
[#title]
<br />
{{header.tpl}}
<br />link{{menu.tpl}}<br />
<hr/>
<h1>[#subtitle]</h1>
[#content]
{submenu}
{itemactive}<strong>[#link]</strong>{/itemactive}
{itema}[#link]{/item}
{/submenu}
';
$pattern = '^\{{.*}}^';
preg_match_all($pattern, $text, $matches, PREG_SET_ORDER);
print_r($matches);
?>
It gives some results though.
:
Array
(
[0] => Array
(
[0] => {{header.tpl}}
)
[1] => Array
(
[0] => {{menu.tpl}}
)
)
Is this what I want?
No but.... close!
Because when I am using the now nicely formatted $text string as one long string.
Like :
$text = 'Hi this is a text<br />[#title]<br />{{header.tpl}}<br />link{{menu.tpl}}<br /><hr/><h1>[#subtitle]</h1>[#content]';
It goes wrong!
The result will become:
Array
(
[0] => Array
(
[0] => {{header.tpl}}<br />link{{menu.tpl}}
)
)
And even then I want the result to be just like the one above!
Then the second problem...
I think I should use the same option for getting the submenu.
Something like :
$pattern = '^\{submenu}.*{/submenu}^';
But strangely that does not work. :-(
And all that I get is:
Array
(
)
Would anyone be able to tell me what I am doing wrong?
TIAD!!
You where close.
The problem with ^\{{.*}}^
.* is greedy and would match anything till the next }} change that to a non greedy .*? or as in the below regex.
A better regex would be
\{\{[^}]+}}
Example : http://regex101.com/r/gF4jZ6/1
\{\{ matches the {{
[^}]+ matches anything other than a }
}} matches }}
Will give an output as
Array ( [0] => Array ( [0] => {{header.tpl}} ) [1] => Array ( [0] => {{menu.tpl}} ) )
Note for differnece between the two regexes see this link also
Now inoder to match submenu, just add an s flag so that the . matches new line as well
$pattern = '/\{submenu}.*?{\/submenu}/s';

preg_match returns an empty string even there is a match

I am trying to extract all meta tags in web page, currently am using preg_match_all to get that, but unfortunately it returns an empty strings for the array indexes.
<?php
$meta_tag_pattern = '/<meta(?:"[^"]*"[\'"]*|\'[^\']*\'[\'"]*|[^\'">])+>/';
$meta_url = file_get_contents('test.html');
if(preg_match_all($meta_tag_pattern, $meta_url, $matches) == 1)
echo "there is a match <br>";
print_r($matches);
?>
Returned array:
Array ( [0] => Array ( [0] => [1] => [2] => [3] => ) ) Array ( [0] => Array ( [0] => [1] => [2] => [3] => ) )
An example with DOMDocument:
$url = 'test.html';
$dom = new DOMDocument();
#$dom->loadHTMLFile($url);
$metas = $dom->getElementsByTagName('meta');
foreach ($metas as $meta) {
echo htmlspecialchars($dom->saveHTML($meta));
}
UPDATED: Example grabbing meta tags from URL:
$meta_tag_pattern = '/<meta\s[^>]+>/';
$meta_url = file_get_contents('http://stackoverflow.com/questions/10551116/html-php-escape-and-symbols-while-echoing');
if(preg_match_all($meta_tag_pattern, $meta_url, $matches))
echo "there is a match <br>";
foreach ( $matches[0] as $value ) {
print htmlentities($value) . '<br>';
}
Outputs:
there is a match
<meta name="twitter:card" content="summary">
<meta name="twitter:domain" content="stackoverflow.com"/>
<meta name="og:type" content="website" />
...
Looks like part of the problem is the browser rendering the meta tags as meta tags and not displaying the text when you print_r the output, so they need to be escaped.

Regular expression in PHP to return array with all images from html, eg: all src="images/header.jpg" instances

I'd like to be able to return an array with a list of all images (src="" values) from html
[0] = "images/header.jpg"
[1] = "images/person.jpg"
is there a regular expression that can do this?
Many thanks in advance!
Welcome to the world of the millionth "how to exactract these values using regex" question ;-) I suggest to use the search tool before seeking an answer -- here is just a handful of topics that provide code to do exactly what you need;
replacing all image src tags in HTML text
getting image src in php
How to extract img src, title and alt from html using php?
Matching SRC attribute of IMG tag using preg_match
php regex : get src value
Dynamically replace the “src” attributes of all <img> tags (redux)
preg_match_all , get all img tag that include a string
/src="([^"]+)"/
The image will be in group 1.
Example:
preg_match_all('/src="([^"]+)"/', '<img src="lol"><img src="wat">', $arr, PREG_PATTERN_ORDER);
Returns:
Array
(
[0] => Array
(
[0] => src="lol"
[1] => src="wat"
)
[1] => Array
(
[0] => lol
[1] => wat
)
)
Here is a more polished version of the regular expression provided by Håvard:
/(?<=src=")[^"]+(?=")/
This expression uses Lookahead & Lookbehind Assertions to get only what you want.
$str = '<img src="/img/001.jpg"><img src="/img/002.jpg">';
preg_match_all('/(?<=src=")[^"]+(?=")/', $str, $srcs, PREG_PATTERN_ORDER);
print_r($srcs);
The output will look like the following:
Array
(
[0] => Array
(
[0] => /img/001.jpg
[1] => /img/002.jpg
)
)
I see that many peoples struggle with Håvard's post and <script> issue. Here is same solution on more strict way:
<img.*?src="([^"]+)".*?>
Example:
preg_match_all('/<img.*?src="([^"]+)".*?>/', '<img src="lol"><img src="wat">', $arr, PREG_PATTERN_ORDER);
Returns:
Array
(
[1] => Array
(
[0] => "lol"
[1] => "wat"
)
)
This will avoid other tags to be matched. HERE is example.

How to parse a file in PHP

I have a string with the following content
[text1] some content [text2] some
content some content [text3] some
content
The "[textn]" are finite and also have specific names. I want to get the content into an array. Any idea?
If you don't wanna use regular expressions, then strtok() would do the trick here:
strtok($txt, "["); // search for first [
while ($id = strtok("]")) { // alternate ] and [
$result[$id] = strtok("["); // add token
}
In php there are function for splitting the string with regexp delimiters, like preg_match, preg_match_all, look them up.
If you have a word list, you can split the string like this (obviously, one could write it much nicer):
$words = array('[text1]','[text2]','[text3]');
$str = "[text1] some content [text2] some content some content [text3] some content3";
for ($i=0; $i<sizeof($words) ; $i++) {
$olddel = $del;
$del = $words[$i];
list($match,$str) = explode($del,$str);
if ($i-1 >= 0) { $matches[$i-1] = $olddel.' '.$match; }
}
$matches[] =$del." ".$str;
print_r($matches);
This will output: Array ( [0] => [text1] some content [1] => [text2] some content some content [2] => [text3] some content3 )
preg_match or preg_match_all, you need to give us an example if you want regex.
$string = "[text1] some content [text2] some content some content [text3] some content";
preg_match_all("#\[([^\[\]]+)\]#is", $string, $matches);
print_r($matches); //Array ( [0] => Array ( [0] => [text1] [1] => [text2] [2] => [text3] ) [1] => Array ( [0] => text1 [1] => text2 [2] => text3 ) )
Non-recursive.
Is [ and ] part of the string or did you just use them to highlight the part that you want to extract? If it is not, then you can use
if (preg_match_all("/\b(text1|text2|text3|foo|bar)\b/i", $string, $matches)) {
print_r($matches);
}

Categories