PHP Regex, remove string from another one if expression is valid

PHP Regex, remove string from another one if expression is valid - php

There are many questions like this question but I can not find exact answer. And I am unfamiliar Regular Expresion topic.
PHP7 : I want to check if $str contains a html code and its href refer to the url : "website.fr" like '*****'
i used the pattern <\b[a>]\S*\<\/a> but is not working.
Any help please.

This regexp catches an a element with an href attribute which refers to a website.fr url:
<a.*\shref="([^"]*\.)?website\.fr([.\/][^"]*)?"
Explanation:
<a[^>]*: an anchor beginning
\shref=": ...followed by an opened href attribute
([^"]*\.)? : the URL may begin by anything except a quote and finishing by a dot
website\.fr : your website
([.\/][^"]*)?: the URL may finish by a slash followed by anything except a quote
This regexp may not cover all cases (for example an URL containing a quote). Generally, it's discouraged to parse HTML with regexes. Better use a XML parser.

In general, parsing HTML with regex is a bad idea (see this question). In PHP, you can use DOMDocument and DOMXPath to search for elements with specific attributes in an HTML document. Something like this, which searches for an <a> element somewhere in the HTML which has an href value containing the string 'website.fr/':
$html = '*****';
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
if (count($xpath->query("//a[contains(#href, 'website.fr/')]")))
echo "found";
else
echo "not found";
Demo on 3v4l.org

Related

How to get text between div tags that contain class, style etc attributes before id attribute. I need to use regular expression

Hi I'm using this regular expression for getting the text inside test
<div id = "test">text</div>
$regex = "#\<div id=\"test\"\>(.+?)\<\/div\>#s";
But if the scenario change for e.g.
<div class="testing" style="color:red" .... more attributes and id="test">text</div>
or
<div class="testing" ...some attributes... id="test".... some attributes....>text</div>
or
<div id="test" .........any number of attributes>text</div>
then the above regex will not be able to extract the text between div tag. In 1st case if more attributes are placed in front of id attribute of div tag i.e id attribute being the last attribute the above regex don work. In second case id attribute is between some attributes and in 3rd case it is the 1st attribute of div tag.
Can I have a regex that can match the above 3 conditions so as to extract the text between div tags by specifying ID ONLY. Have to use regex only :( .
Please Help
Thank you....

I would strongly recommend an HTML parser to save yourself from the never-ending grief of trying to write a regular expression to parse HTML/XML.

I suggest you obtain that DOM element via xpath, the xpath expression for that element is:
//div[#class="testing"]
All this can be done with the PHP DOMDocument extension or alternatively with the SimpleXML extension. Both ship in 99,9% with PHP, same as with the regular expression extension, some rough example code (demo):
echo simplexml_import_dom(#DOMDocument::loadHTML($html))
->xpath('//div[#class="testing"]')[0];
Xpath is a specialized language for querying elements and data from XML documents, where as regular expression is a language for more simple strings.
Edit: Same for ID: http://codepad.viper-7.com/h1FlO0
//div[#id="test"]
I guess you understand quite quickly how these simple xpath expressions work.

Here's the answer with DOM (kind of crudish but works)
$aPieceOfHTML = '<div class="testing" id="test" style="color:red">This is my text blabla<div>';
$doc = new DOMDocument();
$doc->loadHTML($aPieceOfHTML);
$div = $doc->getElementsByTagName("div");
$mytext = $div->item(0)->nodeValue;
echo $mytext;
Here's the Cthulhu way:
$regex = '/(?<=id\=\"test\"\>).*(?=\<\/div\>)/';
DISCLAIMER
By no means I guarantee this will work in every case (far from it). In fact, this will fail if:
id="test" is not the last tag attribute
if there is a space (or anything) between id="test" and the closing >.
If the div tag is not properly closed </div>
If the tags are written in uppercase
If tag attributes are written in uppercase
I don't know... this will probably fail in more cases
I could try to write a more complex regex but I don't think I could come up with something much better than this. Besides, it kind of seems a waste of time when you have other tools built in PHP that can parse HTML so much better.

I don't know if you still need this, but the RegEx below works for all of the give scenarios in your question.
(!?(<.*?>)|[^<]+)\s*
https://regex101.com/r/DAObw0/1
The matching group can be accessed with:
const [_, group1, group2] = myRegex.Exec(input)

PHP preg_match to find and locate a dynamic URL from HTML Pages

I need help with a REGEX that will find a link that comes in different formats based on how it got inserted to the HTML page.
I am capable of reading the pages into PHP. Just not able to the right REGEX that will find URL and insulate them.
I have a few examples on how they are getting inserted. Where sometimes they are plain text links, some of wrapped around them. There is even the odd occasion where text that is not part of the link gets inserted without spacing.
Both Article ID and Article Key are never the same. Article Key however always ends with a numeric. If this is possible I sure could use the help. Thanks
Here are a few examples.
http://www.example.com/ArticleDetails.aspx?ArticleID=3D10045411&AidKey=3D-2086622941
http://example.com/ArticleDetails.aspx?ArticleID=10919199&AidKey=1956996566
http://www.example.com/ArticleDetails.aspx?ArticleID=10773616&AidKey=1998267392
This is a link description
http://example.com/ArticleDetails.aspx?ArticleID=10975137&AidKey=701321736this is not part of the url.
In the end I am just looking for the URL.
http://example.com/ArticleDetails.aspx?ArticleID=10975137&AidKey=701321736

DO NOT USE A REGEX! Use a XML parser...
$dom = DOMDocument::loadHTMLFile($pathToFile);
$finder = new DOMXpath($dom);
$anchors = $finder->query('//a[#href]');
foreach($anchors as $anchor){
$href = $anchor->getAttribute('href');
if(preg_match($regexToMatchUrls, $href)){
//do stuff
}
}
So $regexToMatchUrls would be a regex jsut to match the URLs your are looking for... not any of the html which is much simpler - then you can take action when a match occurs.

This regex work for me:
/http:\/\/(www\.)?example\.com\/ArticleDetails.aspx\?ArticleID=(.*?)(\&|\&)AidKey=([\d\w-]*)/g
UPDATE:
I added a \d at the end of the regex.
/http:\/\/(www\.)?example\.com\/ArticleDetails.aspx\?ArticleID=(.*?)(\&|\&)AidKey=([\d\w-]*)\d/g
To use it in PHP you need /.../msi
PHP Example in action: http://ideone.com/N0TKM

php - preg_match string not within the href attribute

i find regex kinda confusing so i got stuck with this problem:
i need to insert <b> tags on certain keywords in a given text. problem is that if the keyword is within the href attribute, it would result to a broken link.
the code goes like this:
$text = preg_replace('/(\b'.$keyword.'\b)/i','<b>\1</b>',$text);
so for cases like
this keyword here
i end up with:
this <b>keyword</b> here
i tried all sorts of combinations but i still couldn't get the right pattern.
thanks!

You can't only use Regex to do that. They are powerful, but they can't parse recursive grammar like HTML.
Instead you should properly parse the HTML using a existing HTML parser. you just have to echo the HTML unless you encouter some text entity. In that case, you run your preg_repace on the text before echoing it.
If your HTML is valid XHTML, you can use the xml_parse function. if it's not, then use whatever HTML parser is available.

You can use preg_replace again after the first replacement to remove b tags from href:
$text=preg_replace('#(href="[^"]*)<b>([^"]*)</b>#i',"$1$2",$text);

Yes, you can use regex like that, but the code might become a little convulted.
Here is a quick example
$string = 'link text with keyword and stuff';
$keyword = 'keyword';
$text = preg_replace(
'/(<a href=")('.$keyword.')(.php">)(.*)(<\/a>)/',
"$1$2$3<b>$4</b>$5",
$string
);
echo $string."\n";
echo $text."\n";
The content inside () are stored in variables $1,$2 ... $n, so I don't have to type stuff over again. The match can also be made more generic to match different kinds of url syntax if needed.
Seeing this solution you might want to rethink the way you plan to do matching of keywords in your code. :)
output:
link text with keyword and stuff
<b>link text with keyword and stuff</b>

str_replace within certain html tags only

I have an html page loaded into a PHP variable and am using str_replace to change certain words with other words. The only problem is that if one of these words appears in an important peice of code then the whole thing falls to bits.
Is there any way to only apply the str_replace function to certain html tags? Particularly: p,h1,h2,h3,h4,h5
EDIT:
The bit of code that matters:
$yay = str_ireplace($find, $replace , $html);
cheers and thanks in advance for any answers.
EDIT - FURTHER CLARIFICATION:
$find and $replace are arrays containing words to be found and replaced (respectively). $html is the string containing all the html code.
a good example of it falling to bits would be if I were to find and replace a word that occured in e.g. the domain name. So if I wanted to replace the word 'hat' with 'cheese'. Any occurance of an absolute path like
www.worldofhat.com/images/monkey.jpg
would be replaced with:
www.worldofcheese.com/images/monkey.jpg
So if the replacements could only occur in certain tags, this could be avoided.

Do not treat the HTML document as a mere string. Like you already noticed, tags/elements (and how they are nested) have meaning in an HTML page and thus, you want to use a tool that knows what to make of an HTML document. This would be DOM then:
Here is an example. First some HTML to work with
$html = <<< HTML
<body>
<h1>Germany reached the semi finals!!!</h1>
<h2>Germany reached the semi finals!!!</h2>
<h3>Germany reached the semi finals!!!</h3>
<h4>Germany reached the semi finals!!!</h4>
<h5>Germany reached the semi finals!!!</h5>
<p>Fans in Germany are totally excited over their team's 4:0 win today</p>
</body>
HTML;
And here is the actual code you would need to make Argentina happy
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//*[self::h1 or self::h2 or self::p]');
foreach( $nodes as $node ) {
$node->nodeValue = str_replace('Germany', 'Argentina', $node->nodeValue);
}
echo $dom->saveHTML();
Just add the tags you want to replace content in the XPath query call. An alternative to using XPath would be to use DOMDocument::getElementsByTagName, which you might know from JavaScript:
$nodes = $dom->getElementsByTagName('h1');
In fact, if you know it from JavaScript, you might know a lot more of it, because DOM is actually a language agnostic API defined by the W3C and implemented in many languages. The advantage of XPath over getElementsByTagName is obviously that you can query multiple nodes in one go. The drawback is, you have to know XPath :)

How to remove text between tags in php?

Despite using PHP for years, I've never really learnt how to use expressions to truncate strings properly... which is now biting me in the backside!
Can anyone provide me with some help truncating this? I need to chop out the text portion from the url, turning
text
into

$str = preg_replace('#(<a.*?>).*?(</a>)#', '$1$2', $str)

Using SimpleHTMLDom:
<?php
// example of how to modify anchor innerText
include('simple_html_dom.php');
// get DOM from URL or file
$html = file_get_html('http://www.example.com/');
//set innerText to null for each anchor
foreach($html->find('a') as $e) {
$e->innerText = null;
}
// dump contents
echo $html;
?>

What about something like this, considering you might want to re-use it with other hrefs :
$str = 'text';
$result = preg_replace('#(<a[^>]*>).*?(</a>)#', '$1$2', $str);
var_dump($result);
Which will get you :
string '' (length=24)
(I'm considering you made a typo in the OP ? )
If you don't need to match any other href, you could use something like :
$str = 'text';
$result = preg_replace('#().*?()#', '$1$2', $str);
var_dump($result);
Which will also get you :
string '' (length=24)
As a sidenote : for more complex HTML, don't try to use regular expressions : they work fine for this kind of simple situation, but for a real-life HTML portion, they don't really help, in general : HTML is not quite "regular" "enough" to be parsed by regexes.

You could use substring in combination with stringpos, eventhough this is not
a very nice approach.
Check: PHP Manual - String functions
Another way would be to write a regular expression to match your criteria.
But in order to get your problem solved quickly the string functions will do...
EDIT: I underestimated the audience. ;) Go ahead with the regexes... ^^

You don't need to capture the tags themselves. Just target the text between the tags and replace it with an empty string. Super simple.
Demo of both techniques
Code:
$string = 'text';
echo preg_replace('/<a[^>]*>\K[^<]*/', '', $string);
// the opening tag--^^^^^^^^ ^^^^^-match everything before the end tag
// ^^-restart fullstring match
Output:
Or in fringe cases when the link text contains a <, use this: ~<a[^>]*>\K.*?(?=</a>)~
This avoids the expense of capture groups using a lazy quantifier, the fullstring restarting \K and a "lookahead".
Older & wiser:
If you are parsing valid html, you should use a dom parser for stability/accuracy. Regex is DOM-ignorant, so if there is a tag attribute value containing a >, my snippet will fail.
As a narrowly suited domdocument solution to offer some context:
$dom = new DOMDocument;
$dom->loadHTML($string, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD); // 2nd params to remove DOCTYPE);
$dom->getElementsByTagName('a')[0]->nodeValue = '';
echo $dom->saveHTML();

Only use strip_tags(), that would get rid of the tags and left only the desired text between them

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP Regex, remove string from another one if expression is valid - php

Related

How to get text between div tags that contain class, style etc attributes before id attribute. I need to use regular expression

PHP preg_match to find and locate a dynamic URL from HTML Pages

php - preg_match string not within the href attribute

str_replace within certain html tags only

How to remove text between tags in php?

Categories

Resources