php preg_match_all words starting with? - php

I'm pulling in a calendar from an external site with file_get_contents, so I can use jQuery .load on it.
In order to fix the relative path issues with this approach, I'm using
preg_match_all.
So doing
preg_match_all("/<a href='([^\"]*)'/iU", $string, $match);
Gets me all the occurrences of <a href = ''
What I'm after are the just the links inside the single quotes.
Now each link starts with "?date" so I have <a href='?date=4%2F9%2F2014&a' etc.
How can I efficiently get the string between single quotes in all <a href= occurrences.

Use the Dom parser to get the href from the <a> tag
<?php
$file = "your.html";
$doc = new DOMDocument();
$doc->loadHTMLFile($file);
$elements = $doc->getElementsByTagName('a');
foreach ($elements as $tag) {
echo $tag->getAttribute('href');
}

Related

remove tags <a> to a specific URL domain php [duplicate]

This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 3 years ago.
This is a script code that is not mine, I try to modify it. What it does search for all the tags and then delete them. How would you modify the code to erase only the tags of a given domain or url? for example, delete the domain tags: www.domainurl.com , Remove all tags as:
fsdf
<a title="Google Adsense" href="https://www.domainurl.com/refer/google-adsense/" target="_blank" rel="nofollow noopener">fgddf</a>
domain
<a title="Google Adsense" href="https://www.googlead.com/refer/google-adsense/" target="_blank" rel="nofollow noopener">googled</a>
result would look like this:
fsdf
fgddf
domain
<a title="Google Adsense" href="https://www.googlead.com/refer/google-adsense/" target="_blank" rel="nofollow noopener">google</a>
This is the code :
if (in_array ( 'OPT_STRIP', $camp_opt )) {
echo '<br>Striping links ';
//$abcont = strip_tags ( $abcont, '<p><img><b><strong><br><iframe><embed><table><del><i><div>' );
preg_match_all('{<a.*?>(.*?)</a>}' , $abcont , $allLinksMatchs);
$allLinksTexts = $allLinksMatchs[1];
$allLinksMatchs=$allLinksMatchs[0];
$j = 0;
foreach ($allLinksMatchs as $singleLink){
if(! stristr($singleLink, 'twitter.com'))
$abcont = str_replace($singleLink, $allLinksTexts[$j], $abcont);
$j++;
}
}
I tried doing this but it did not work for me:
Regex :
Specifying in the search with preg_match_all
preg_match_all('{<a.*?[^>]* href="((https?:\/\/)?([\w\-])+\.{1}domainurl\.([a-z]{2,6})([\/\w\.-]*)*\/?)">(.*?)</a>}' , $abcont , $allLinksMatchs);
Any ideas? , I would thank you a lot
Rather than try and parse HTML with regular expressions, as you suggested, I have chosen to use the DOMDocument class instead.
function remove_domain($str, $domainsToRemove)
{
$domainsToRemove = is_array($domainsToRemove) ? $domainsToRemove : array_slice(func_get_args(), 1);
$dom = new DOMDocument;
$dom->loadHTML("<div>{$str}</div>", LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$anchors = $dom->getElementsByTagName('a');
// Code taken and modified from: http://php.net/manual/en/domnode.replacechild.php#50500
$i = $anchors->length - 1;
while ($i > -1) {
$anchor = $anchors->item($i);
foreach ($domainsToRemove as $domain) {
if (strpos($anchor->getAttribute('href'), $domain) !== false) {
// $new = $dom->createElement('p', $anchor->textContent);
$new = $dom->createTextNode($anchor->textContent);
$anchor->parentNode->replaceChild($new, $anchor);
}
}
$i--;
}
// Create HTML string, then remove the wrapping div.
$html = $dom->saveHTML();
$html = substr($html, 5, strlen($html) - (strlen('</div>') + 1) - strlen('<div>'));
return $html;
}
You can then use the above code in the following examples.
Notice how you can either pass in a string as a domain to remove, or you can pass an array of domains, or you can take advantage of func_get_args and pass in an infinite number of parameters.
$str = <<<str
fsdf
<a title="Google Adsense" href="https://www.domainurl.com/refer/google-adsense/" target="_blank" rel="nofollow noopener">fgddf</a>
domain
<a title="Google Adsense" href="https://www.googlead.com/refer/google-adsense/" target="_blank" rel="nofollow noopener">googled</a>
str;
// Example usage
remove_domain($str, 'domainurl.com');
remove_domain($str, 'domainurl.com', 'googlead.com');
remove_domain($str, ['domainurl.com', 'googlead.com']);
Firstly, I have stored your string in a variable, but that is just so that I could utilize it for the answer; replace $str with wherever you get that code from.
The loadHTML function takes an HTML string, but requires one child element - hence why I have wrapped the string in a div.
The while loop will iterate over the anchor elements, and then replace any that match a specified domain with just the content of the anchor tags.
Note, I have left in a comment above this line which you can use instead. This will replace the anchor element with a p tag, which will have a default style of display: block; meaning that your layout won't be likely to break. However, since your expected output is just text nodes, I have left this as just an option.
Live demo
What about:
<a.*? href=\".*www\.googlead\.com.*\">(.*?)<\/a>
So it becomes:
preg_match_all('{<a.*? href=\".*www\.googlead\.com.*\">(.*?)<\/a>}' , $abcont , $allLinksMatchs);
This removes only a tags from www.googlead.com.
You can check the regex result here.
Supposing your HTML is contained in a variable for the following.
The usage of preg_replace should be a better option, here's a function that should help you a bit:
function removeLinkTagsOfDomain($html, $domain) {
// Escape all regex special characters
$domain = preg_quote($domain);
// Search for <a> tags with a href attribute containing the specified domain
$pattern = '/<a .*href=".*' . $domain . '.*".*>(.+)<\/a>/';
// Final replacement (should be the text node of <a> tags)
$replacer = '$1';
return preg_replace($pattern, '$1', $html);
}
// Usage:
$domains = [...];
$html = '...';
foreach ($domains as $d) {
$html = removeLinkTagsOfDomain($html, $d);
}

How can I exclude regex href matches of a particular domain?

How can I exclude href matches for a domain (ex. one.com)?
My current code:
$str = 'This string has one link and another link';
$str = preg_replace('~<a href="(https?://[^"]+)".*?>.*?</a>~', '$1', $str);
echo $str; // This string has http://one.com and http://two.com
Desired result:
This string has one link and http://two.com
Using a regular expression
If you're going to use a regular expression to accomplish this task, you can use a negative lookahead. It basically asserts that the part // in the href attribute is not followed by one.com. It's important to note that a lookaround assertion doesn't consume any characters.
Here's how the regular expression would look like:
<a href="(https?://(?!one\.com)[^"]+)".*?>.*?</a>
Regex Visualization:
Regex101 demo
Using a DOM parser
Even though this is a pretty simple task, the correct way to achieve this would be using a DOM parser. That way, you wouldn't have to change the regex if the format of your markup changes in future. The regex solution will break if the <a> node contains more attribute values. To fix all those issues, you can use a DOM parser such as PHP's DOMDocument to handle the parsing:
Here's how the solution would look like:
$dom = new DOMDocument();
$dom->loadHTML($html); // $html is the string containing markup
$links = $dom->getElementsByTagName('a');
//Loop through links and replace them with their anchor text
for ($i = $links->length - 1; $i >= 0; $i--) {
$node = $links->item($i);
$text = $node->textContent;
$href = $node->getAttribute('href');
if ($href !== 'http://one.com') {
$newTextNode = $dom->createTextNode($text);
$node->parentNode->replaceChild($newTextNode, $node);
}
}
echo $dom->saveHTML();
Live Demo
This should do it:
<a href="(https?://(?!one\.com)[^"]+)".*?>.*?</a>
We use a negative lookahead to make sure that one.com does not appear directly after the https?://.
If you also want to check for some subdomains of one.com, use this example:
<a href="(https?://(?!((www|example)\.)?one\.com)[^"]+)".*?>.*?</a>
Here we optionally check for www. or example. before one.com. This will allow a URL like misc.com, though. If you want to remove all subdomains of one.com, use this:
<a href="(https?://(?!([^.]+\.)?one\.com)[^"]+)".*?>.*?</a>

Regex preg_replace find image in string WITH img attributes

I'm trying to find ALL images in my blog posts with regex. The code below returns images IF the code is clean and the SRC tag comes right after the IMG tag. However, I also have images with other attributes such as height and width. The regex I have does not pick that up... Any ideas?
The following code returns images that looks like this:
<img src="blah_blah_blah.jpg">
But not images that looks like this:
<img width="290" height="290" src="blah_blah_blah.jpg">
Here is my code
$pattern = '/<img\s+src="([^"]+)"[^>]+>/i';
preg_match($pattern, $data, $matches);
echo $matches[1];
Use DOM or another parser for this, don't try to parse HTML with regular expressions.
$html = <<<DATA
<img width="290" height="290" src="blah.jpg">
<img src="blah_blah_blah.jpg">
DATA;
$doc = new DOMDocument();
$doc->loadHTML($html); // load the html
$xpath = new DOMXPath($doc);
$imgs = $xpath->query('//img');
foreach ($imgs as $img) {
echo $img->getAttribute('src') . "\n";
}
Output
blah.jpg
blah_blah_blah.jpg
Ever think of using the DOM object instead of regex?
$doc = new DOMDocument();
$doc->loadHTML('<img src="http://example.com/img/image.jpg" ... />');
$imageTags = $doc->getElementsByTagName('img');
foreach($imageTags as $tag) {
echo $tag->getAttribute('src');
}
You'd better to use a parser, but here is a way to do with regex:
$pattern = '/<img\s.*?src="([^"]+)"/i';
The problem is that you only accept \s+ after <img. Try this instead:
$pattern = '/<img\s+[^>]*?src="([^"]+)"[^>]+>/i';
preg_match($pattern, $data, $matches);
echo $matches[1];
Try this:
$pattern = '/<img\s.*?src=["\']([^"\']+)["\']/i';
Single or double quote and dynamic src attr position.

PHP regex - Find the highest value

I need find the highest number on a string like this:
Example
<div id='pages'>
<a href='pages.php?start=0&end=20'>Page 1</a>
<a href='pages.php?start=20&end=40'>Page 2</a>
<a href='pages.php?start=40&end=60'>Page 3</a>
<a href='pages.php?start=60&end=80'>Page 4</a>
<a href='pages.php?start=80&end=89'>Page 5</a>
</div>
In this example, I should get 89, because it's the highest number on "end" value.
I think I should use regex, but I don't know how :(
Any help would be very appreciated!
You shouldn't be doing this with a regex. In fact, I don't even know how you would. You should be using an HTML parser, parsing out the end parameter from each <a> tag's href attribute with parse_str(), and then finding the max() of them, like this:
$doc = new DOMDocument;
$doc->loadHTML( $str); // All & should be encoded as &
$xpath = new DOMXPath( $doc);
$end_vals = array();
foreach( $xpath->query( '//div[#id="pages"]/a') as $a) {
parse_str( $a->getAttribute( 'href'), $params);
$end_vals[] = $params['end'];
}
echo max( $end_vals);
The above will print 89, as seen in this demo.
Note that this assumes your HTML entities are properly escaped, otherwise DOMDocument will issue a warning.
One optimization you can do is instead of keeping an array of end values, just compare the max value seen with the current value. However this will only be useful if the number of <a> tags grows larger.
Edit: As DaveRandom points out, if we can make the assumption that the <a> tag that holds the highest end value is the last <a> tag in this list, simply due to how paginated links are presented, then we don't need to iterate or keep a list of other end values, as shown in the following example.
$doc = new DOMDocument;
$doc->loadHTML( $str);
$xpath = new DOMXPath( $doc);
parse_str( $xpath->evaluate( 'string(//div[#id="pages"]/a[last()]/#href)'), $params);
echo $params['end'];
To find the highest number in the entire string, regardless of position, you can use
preg_split — Split string by a regular expression
max — Find highest value
Example (demo)
echo max(preg_split('/\D+/', $html, -1, PREG_SPLIT_NO_EMPTY)); // prints 89
This works by splitting the string by anything that is not a number, leaving you with an array containing all the numbers in the string and then fetching the highest number from that array.
first extract all the numbers from the links then apply max function:
$str = "<div id='pages'>
<a href='pages.php?start=0&end=20'>Page 1</a>
<a href='pages.php?start=20&end=40'>Page 2</a>
<a href='pages.php?start=40&end=60'>Page 3</a>
<a href='pages.php?start=60&end=80'>Page 4</a>
<a href='pages.php?start=80&end=89'>Page 5</a>
</div>";
if(preg_match_all("/href=['][^']+end=([0-9]+)[']/i", $str, $matches))
{
$maxVal = max($matches[1]);
echo $maxVal;
}
function getHighest($html) {
$my_document = new DOMDocument();
$my_document->loadHTML($html);
$nodes = $my_document->getElementsByTagName('a');
$numbers = array();
foreach ($nodes as $node) {
if (preg_match('\d+$', $node->getAttribute('href'), $match) == 1) {
$numbers[]= intval($match[0])
}
}
return max($numbers);
}

Using regex to remove HTML tags

I need to convert
$text = 'We had <i>fun</i>. Look at this photo of Joe';
[Edit] There could be multiple links in the text.
to
$text = 'We had fun. Look at this photo (http://example.com) of Joe';
All HTML tags are to be removed and the href value from <a> tags needs to be added like above.
What would be an efficient way to solve this with regex? Any code snippet would be great.
First do a preg_replace to keep the link. You could use:
preg_replace('(.*?)', '$\2 ($\1)', $str);
Then use strip_tags which will finish off the rest of the tags.
try an xml parser to replace any tag with it's inner html and the a tags with its href attribute.
http://www.php.net/manual/en/book.domxml.php
The DOM solution:
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach($xpath->query('//a[#href]') as $node) {
$textNode = new DOMText(sprintf('%s (%s)',
$node->nodeValue, $node->getAttribute('href')));
$node->parentNode->replaceChild($textNode, $node);
}
echo strip_tags($dom->saveHTML());
and the same without XPath:
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach($dom->getElementsByTagName('a') as $node) {
if($node->hasAttribute('href')) {
$textNode = new DOMText(sprintf('%s (%s)',
$node->nodeValue, $node->getAttribute('href')));
$node->parentNode->replaceChild($textNode, $node);
}
}
echo strip_tags($dom->saveHTML());
All it does is load any HTML into a DomDocument instance. In the first case it uses an XPath expression, which is kinda like SQL for XML, and gets all links with an href attribute. It then creates a text node element from the innerHTML and the href attribute and replaces the link. The second version just uses the DOM API and no Xpath.
Yes, it's a few lines more than Regex but this is clean and easy to understand and it won't give you any headaches when you need to add additional logic.
I've done things like this using variations of substring and replace. I'd probably use regex today but you wanted an alternative so:
For the <i> tags, I'd do something like:
$text = replace($text, "<i>", "");
$text = replace($text, "</i>", "");
(My php is really rusty, so replace may not be the right function name -- but the idea is what I'm sharing.)
The <a> tag is a bit more tricky. But, it can be done. You need to find the point that <a starts and that the > ends with. Then you extract the entire length and replace the closing </a>
That might go something like:
$start = strrpos( $text, "<a" );
$end = strrpos( $text, "</a>", $start );
$text = substr( $text, $start, $end );
$text = replace($text, "</a>", "");
(I don't know if this will work, again the idea is what I want to communicate. I hope the code fragments help but they probably don't work "out of the box". There are also a lot of possible bugs in the code snippets depending on your exact implementation and environment)
Reference:
strrpos - http://www.php.net/manual/en/function.strrpos.php
replace - http://www.php.net/manual/en/function.str-replace.php
substr - http://php.net/manual/en/function.substr.php
It's also very easy to do with a parser:
# available from http://simplehtmldom.sourceforge.net
include('simple_html_dom.php');
# parse and echo
$html = str_get_html('We had <i>fun</i>. Look at this photo of Joe');
$a = $html->find('a');
$a[0]->outertext = "{$a[0]->innertext} ( {$a[0]->href} )";
echo strip_tags($html);
And that produces the code you want in your test case.

Categories