PHP Regular Expressions. HTML Parse

PHP Regular Expressions. HTML Parse - php

I'm looking for a way to find a html div with a certain ID using php
<?php
$regex = "<div+[a-zA-Z0-9._-\"]+id=\"";
$string = '<html><body><div style="rubbish" id="man"></body></html>';
preg_match($regex, $string, $matches, PREG_OFFSET_CAPTURE);
$var_export = $matches;
$var = $var_export[1][1];
echo substr($string, $var, 3);
?>
I know this is a load of rubbish at the momment but I can't quite get my head around regular expressions.

You may want to try this:
$html = '<html><body><div style="rubbish" id="man">something </div><div id="otherid">blabla</div></body></html>';
preg_match_all('%(<div.*?id="man">.*?</div>)%im', $html, $result, PREG_PATTERN_ORDER);
for ($i = 0; $i < count($result[1]); $i++) {
echo $result[1][$i];
}
DEMO
http://ideone.com/KQv3OA

Related

Grab URL within a string which contains HTML code

I have a string, for example:
$html = '<p>helloworld</p><p>helloworld</p>';
And I want to search the string for the first URL that starts with youtube.com or youtu.be and store it in variable $first_found_youtube_url.
How can I do this efficiently?
I can do a preg_match or strpos looking for the urls but not sure which approach is more appropriate.

I wrote this function a while back, it uses regex and returns an array of unique urls. Since you want the first one, you can just use the first item in the array.
function getUrlsFromString($string) {
$regex = '#\bhttps?://[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/))#i';
preg_match_all($regex, $string, $matches);
$matches = array_unique($matches[0]);
usort($matches, function($a, $b) {
return strlen($b) - strlen($a);
});
return $matches;
}
Example:
$html = '<p>helloworld</p><p>helloworld</p>';
$urls = getUrlsFromString($html);
$first_found_youtube = $urls[0];
With YouTube specific regex:
function getYoutubeUrlsFromString($string) {
$regex = '#(https?:\/\/(?:www\.)?(?:youtube.com\/watch\?v=|youtu.be\/)([a-zA-Z0-9]*))#i';
preg_match_all($regex, $string, $matches);
$matches = array_unique($matches[0]);
usort($matches, function($a, $b) {
return strlen($b) - strlen($a);
});
return $matches;
}
Example:
$html = '<p>helloworld</p><p>helloworld</p>';
$urls = getYoutubeUrlsFromString($html);
$first_found_youtube = $urls[0];

you can parse the html with DOMDocument and look for youtube url's with stripos, something like this
$html = '<p>helloworld</p><p>helloworld</p>';
$DOMD = #DOMDocument::loadHTML($html);
foreach($DOMD->getElementsByTagName("a") as $url)
{
if (0 === stripos($url->getAttribute("href") , "https://www.youtube.com/") || 0 === stripos($url->getAttribute("href") , "https://www.youtu.be"))
{
$first_found_youtube_url = $url->getAttribute("href");
break;
}
}
personally, i would probably use
"youtube.com"===parse_url($url->getAttribute("href"),PHP_URL_HOST)
though, as it would get http AND https links.. which is probably what you want, though strictly speaking, not what you're asking for in top post right now..

I think this will do what you are looking for, I have used preg_match_all simply because I find it easier to debug the regexes.
<?php
$html = '<p>helloworld</p><p>helloworld</p>';
$pattern = '/https?:\/\/(www\.)?youtu(\.be|\com)\/[a-zA-Z0-9\?=]*/i';
preg_match_all($pattern, $html, $matches);
// print_r($matches);
$first_found_youtube = $matches[0][0];
echo $first_found_youtube;
demo - https://3v4l.org/lFjmK

Checking pattern as much as possible

How to make preg find all possible solutions for regular expression pattern?
Here's the code:
<?php
$text = 'Amazing analyzing.';
$regexp = '/(^|\\b)([\\S]*)(a)([\\S]*)(\\b|$)/ui';
$matches = array();
if (preg_match_all($regexp, $text, $matches, PREG_SET_ORDER)) {
foreach ($matches as $match) {
echo "{$match[2]}[{$match[3]}]{$match[4]}\n";
}
}
?>
Output:
Am[a]zing
an[a]lyzing.
Output that i need:
[A]mazing
Am[a]zing
[A]nalyzing.
an[a]lyzing.

You have to use look behind/ahead zero-length assertions (instead of a normal pattern which consumes the characters around what your are looking for): http://www.regular-expressions.info/lookaround.html

Lookaround assertions won't help, for two reasons:
Since they are zero-length, they won't return characters that you need.
As Avinash Raj noted, PHP lookbehind doesn't allow *.
This yields the output that you need:
<?php
$text = 'Amazing analyzing.';
foreach (preg_split('/\s+/', $text) as $word)
{
$matches = preg_split('/(a)/i', $word, 0, PREG_SPLIT_DELIM_CAPTURE);
for ($match = 1; $match < count($matches); $match += 2)
{
$prefix = join(array_slice($matches, 0, $match));
$suffix = join(array_slice($matches, $match+1));
echo "{$prefix}[{$matches[$match]}]{$suffix}\n";
}
}
?>

PHP Regex get reverse number

i have this:
$pattern = 'dev/25{LASTNUMBER}/P/{YYYY}'
$var = 'dev/251/P/2014'
in this situation {LASTNUMBER} = 1 how to get this from $var
vars in pattern can by more always in {}
pattern can by different example :
$pattern = '{LASTNUMBER}/aa/bb/P/{OtherVar}'
in this situation var will by 1/aa/bb/p/some and want get 1
I need get {LASTNUMBER} have pattern and have results
Ok maybe is not possible :) or very very hard

use a regex..
if (preg_match('~dev/25([0-9])/P/[0-9]{4}~', $var, $m)) {
$lastnum = $m[1];
}

$parts = explode("/", $pattern);
if (isset($parts[1])) {
return substr($parts[1], -1);
}
will be faster than regex :)

You probably need this:
<?php
$pattern = 'dev/251/P/2014';
preg_match_all('%dev/25(.*?)/P/[\d]{4}%sim', $pattern, $match, PREG_PATTERN_ORDER);
$match = $match[1][0];
echo $match; // echo's 1
?>
Check it online
If you need to loop trough results you can use:
<?php
$pattern = <<< EOF
dev/251/P/2014
dev/252/P/2014
dev/253/P/2014
dev/254/P/2014
dev/255/P/2014
EOF;
preg_match_all('%dev/25(.*?)/P/[\d]{4}%sim', $pattern , $match, PREG_PATTERN_ORDER);
for ($i = 0; $i < count($match[1]); $i++) {
echo $match[1][$i]; //echo's 12345
}
?>
Check in online

Most efficient way of Extracting tags from multiple strings

I have an html page with multiple instances of the following tags:
<INCLUDEFILE-1-/var/somepath/file1.php>
<INCLUDEFILE-2-/var/somepath/file2.php>
<INCLUDEFILE-3-/var/somepath/file3.php>
<INCLUDEFILE-4-/var/somepath/file4.php>
<INCLUDEFILE-5-/var/somepath/file5.php>
What code can I use to extract all of the paths above? I have so far got the following code but cannot get it to work properly:
preg_match_all('/INCLUDEFILE[^"]+/m', $html, $result, PREG_PATTERN_ORDER);
for ($i = 0; $i < count($result[0]); $i++)
{
$includefile = $result[0][$i];
}
I need to extract:
/var/somepath/file1.php
/var/somepath/file2.php
/var/somepath/file3.php
/var/somepath/file4.php
/var/somepath/file5.php
Can anyone see the obvious mistake(s)?!

The shortest way to happiness:
$pattern = '`<INCLUDEFILE-\d+-\K/[^>\s]+`';
preg_match_all($pattern, $subject, $results);
$results=$results[0];
print_r($results);

I changed your regex slightly and added parenthesis to capture the subpattern you need. I didn't see quotes (") in the posted example so I changed to checking for ">" to detect the end. I also added the ungreedy modifier, you may try how it goes with or without ungreedy. I also check for result[1] which will contain the first subpattern matches.
preg_match_all('/<INCLUDEFILE-[0-9]+-([^>]+)>/Um', $html, $result, PREG_PATTERN_ORDER);
for ($i = 0; $i < count($result[1]); $i++)
{
$includefile = $result[1][$i];
}

You could do it this way:
$html = '
<INCLUDEFILE-1-/var/somepath/file1.php>fadsf
asdfasf<INCLUDEFILE-2-/var/somepath/file2.php>adsfaf
<INCLUDEFILE-3-/var/somepath/file3.php>asdfadsf
<INCLUDEFILE-4-/var/somepath/file4.php>
<INCLUDEFILE-5-/var/somepath/file5.php>
';
$lines = explode(PHP_EOL, $html);
$files = array();
foreach($lines as $line)
{
preg_match('/<INCLUDEFILE-\d+-(.+?)>/', $line, $match);
if(!empty($match)) {
$files[] = $match[1];
}
}
var_dump($files);

Inverse htmlentities / html_entity_decode

Basically I want to turn a string like this:
<code> <div> blabla </div> </code>
into this:
<code> <div> blabla </div> </code>
How can I do it?
The use case (bc some people were curious):
A page like this with a list of allowed HTML tags and examples. For example, <code> is a allowed tag, and this would be the sample:
<code><?php echo "Hello World!"; ?></code>
I wanted a reverse function because there are many such tags with samples that I store them all into a array which I iterate in one loop, instead of handling each one individually...

My version using regular expressions:
$string = '<code> <div> blabla </div> </code>';
$new_string = preg_replace(
'/(.*?)(<.*?>|$)/se',
'html_entity_decode("$1").htmlentities("$2")',
$string
);
It tries to match every tag and textnode and then apply htmlentities and html_entity_decode respectively.

There isn't an existing function, but have a look at this.
So far I've only tested it on your example, but this function should work on all htmlentities
function html_entity_invert($string) {
$matches = $store = array();
preg_match_all('/(&(#?\w){2,6};)/', $string, $matches, PREG_SET_ORDER);
foreach ($matches as $i => $match) {
$key = '__STORED_ENTITY_' . $i . '__';
$store[$key] = html_entity_decode($match[0]);
$string = str_replace($match[0], $key, $string);
}
return str_replace(array_keys($store), $store, htmlentities($string));
}
Update:
Thanks to #Mike for taking the time to test my function with other strings. I've updated my regex from /(\&(.+)\;)/ to /(\&([^\&\;]+)\;)/ which should take care of the issue he raised.
I've also added {2,6} to limit the length of each match to reduce the possibility of false positives.
Changed regex from /(\&([^\&\;]+){2,6}\;)/ to /(&([^&;]+){2,6};)/ to remove unnecessary excaping.
Whooa, brainwave! Changed the regex from /(&([^&;]+){2,6};)/ to /(&(#?\w){2,6};)/ to reduce probability of false positives even further!

Replacing alone will not be good enough for you. Whether it be regular expressions or simple string replacing, because if you replace the &lt &gt signs then the < and > signs or vice versa you will end up with one encoding/decoding (all &lt and &gt or all < and > signs).
So if you want to do this, you will have to parse out one set (I chose to replace with a place holder) do a replace then put them back in and do another replace.
$str = "<code> <div> blabla </div> </code>";
$search = array("<",">",);
//place holder for < and >
$replace = array("[","]");
//first replace to sub out < and > for [ and ] respectively
$str = str_replace($search, $replace, $str);
//second replace to get rid of original < and >
$search = array("<",">");
$replace = array("<",">",);
$str = str_replace($search, $replace, $str);
//third replace to turn [ and ] into < and >
$search = array("[","]");
$replace = array("<",">");
$str = str_replace($search, $replace, $str);
echo $str;

I think i have a small sollution, why not break html tags into an array and then compare and change if needed?
function invertHTML($str) {
$res = array();
for ($i=0, $j=0; $i < strlen($str); $i++) {
if ($str{$i} == "<") {
if (isset($res[$j]) && strlen($res[$j]) > 0){
$j++;
$res[$j] = '';
} else {
$res[$j] = '';
}
$pos = strpos($str, ">", $i);
$res[$j] .= substr($str, $i, $pos - $i+1);
$i += ($pos - $i);
$j++;
$res[$j] = '';
continue;
}
$res[$j] .= $str{$i};
}
$newString = '';
foreach($res as $html){
$change = html_entity_decode($html);
if($change != $html){
$newString .= $change;
} else {
$newString .= htmlentities($html);
}
}
return $newString;
}
Modified .... with no errors.

So, although other people on here have recommended regular expressions, which may be the absolute right way to go ... I wanted to post this, as it is sufficient for the question you asked.
Assuming that you are always using html'esque code:
$str = '<code> <div> blabla </div> </code>';
xml_parse_into_struct(xml_parser_create(), $str, $nodes);
$xmlArr = array();
foreach($nodes as $node) {
echo htmlentities('<' . $node['tag'] . '>') . html_entity_decode($node['value']) . htmlentities('</' . $node['tag'] . '>');
}
Gives me the following output:
<CODE> <div> blabla </div> </CODE>
Fairly certain that this wouldn't support going backwards again .. as other solutions posted, would, in the sense of:
$orig = '<code> <div> blabla </div> </code>';
$modified = '<CODE> <div> blabla </div> </CODE>';
$modifiedAgain = '<code> <div> blabla </div> </code>';

I'd recommend using a regular expression, e.g. preg_replace():
http://www.php.net/manual/en/function.preg-replace.php
http://www.webcheatsheet.com/php/regular_expressions.php
http://davebrooks.wordpress.com/2009/04/22/php-preg_replace-some-useful-regular-expressions/

Edit: It appears that I haven't fully answered your question. There is no built-in PHP function to do what you want, but you can do find and replace with regular expressions or even simple expressions: str_replace, preg_replace

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP Regular Expressions. HTML Parse - php

Related

Grab URL within a string which contains HTML code

Checking pattern as much as possible

PHP Regex get reverse number

Most efficient way of Extracting tags from multiple strings

Inverse htmlentities / html_entity_decode

Categories

Resources