PHP regular expression help?

PHP regular expression help? - php

I have problem writing a regular express which match with only div class name "classBig1" and has one anchor link as its child.
Here is my code but it doesn't work:
preg_match_all ("/<div class=\"headline9\"><a[\s]+[^>]*?href[\s]?=[\s\"\']+".
"(.*?)[\"\']+.*?>"."([^<]+|.*?)?<\/a></div>/",
$var, &$matches);
//example HTML: <div class="classBig1">Go Index99</div>

If the HTML is as well formed as your example then the following regex is enough to solve your problem:
<div class="classBig1"><a .*?</div>
The full PHP code would be:
preg_match_all('%<div class="classBig1"><a .*?</div>%', $html,
$result, PREG_PATTERN_ORDER);
for ($i = 0; $i < count($result[0]); $i++) {
$match = $result[0][$i];
}

I guess you had mentioned a wrong class-name in the code, but I consider it is "classBig1" - please take a look at the pattern that I have given.
I believe:
You just wanted to get those "DIV" which has a class of "classBig1"
These "DIVs" should have only one "A" tag.
If yes, then don't hesitate to grab this piece of code :-).
It seems to be working for me when I tried with a sample HTML code.
Pattern:
"/<div class=\"classBig1\"><a (.*)<\/a><\/div>/"
Hope it helps.

Related

Regex to select url except when = is directly infront of it

I'm trying to use a regex to find and replace all URLs in a forum system. This works but it also selects anything that is within bbcode. This shouldn't be happening.
My code is as follows:
<?php
function make_links_clickable($text){
return preg_replace('!(([^=](f|ht)tp(s)?://)[-a-zA-Zа-яА-Я()0-9#:%_+.~#?&;//=]+)!i', '$1', $text);
}
//$text = "https://www.mcgamerzone.com<br>http://www.mcgamerzone.com/help/support<br>Just text<br>http://www.google.com/<br><b>More text</b>";
$text = "#Theareak We know this and [b][url=https://www.mcgamerzone.com/news/67/False-positive-proxy-bans-and-bot-attacks]here[/url] [/b]is an explanation, we are trying to fix this asap! https://www.mcgamerzone.com/news/67/False-positive-proxy-bans-and-bot-attacks aaa";
echo "<b>Unparsed text:</b><br>";
echo $text;
echo "<br><br>";
echo "<b>Parsed text:</b><br>";
echo make_links_clickable($text);
?>
All urls that occur in bb-code are following up on a = character, meaning that I don't want anything that starts with = to be selected.
I basically have that working but this results in selecting 1 extra character in in front of the string that should be selected.
I'm not very familiar with regex. The final output of my code is this:
<b>Unparsed text:</b><br>
#Theareak We know this and [b][url=https://www.mcgamerzone.com/news/67/False-positive-proxy-bans-and-bot-attacks]here[/url] [/b]is an explanation, we are trying to fix this asap! https://www.mcgamerzone.com/news/67/False-positive-proxy-bans-and-bot-attacks aaa<br>
<br>
<b>Parsed text:</b><br>
#Theareak We know this and [b][url=https://www.mcgamerzone.com/news/67/False-positive-proxy-bans-and-bot-attacks]here[/url] [/b]is an explanation, we are trying to fix this asap! https://www.mcgamerzone.com/news/67/False-positive-proxy-bans-and-bot-attacks aaa

You can match and skip [url=...] like this:
\[url=[^\]]*](*SKIP)(?!)|(((f|ht)tps?://)[-a-zA-Zа-яёЁА-Я()0-9#:%_+.\~#?&;/=]+)
See regex demo
That way, you will only match the URLs outside the [url=...] tag.
IDEONE demo:
function make_links_clickable($text){
return preg_replace('~\[url=[^\]]*](*SKIP)(?!)|(((f|ht)tps?://)[-a-zA-Zа-яёЁА-Я()0-9#:%_+.\~#?&;/=]+)~iu', '$1', $text);
}
$text = "#Theareak We know this and [b][url=https://www.mcgamerzone.com/news/67/False-positive-proxy-bans-and-bot-attacks]here[/url] [/b]is an explanation, we are trying to fix this asap! https://www.mcgamerzone.com/news/67/False-positive-proxy-bans-and-bot-attacks aaa";
echo "<b>Parsed text:</b><br>";
echo make_links_clickable($text);

You can use a negative lookbehind (?<!=) instead of your negated class. It asserts that what is going to be matched isn't preceded by something.
Example

regular expression for href from html page in php ?

i want get all/farsi_persian/*/subtitle-*.aspx from one html page
i try some regular expression on PHP but not find
can help me ?>
<span class="r0" title="Rating 0 out of 10">Farsi/Persian</span> Arrow - Third Season subtitle<br><small> Arrow.S03E03.HDTV.480p.x264-LOL </small>

Try
/farsi_persian/[^/]+/subtitle-[^.]+.aspx

try using preg_match_all(), this might work::
preg_match_all('/\/farsi_persian\/[\w-]+\/subtitle-[\d]+.aspx/', $str, $matches);
assuming there's always numbers after subtitle-

$myPattern = "/farsi_persian/[^/]+/subtitle-[^.]+.aspx";
preg_match_all($myPattern,$myText,$match);
var_dump($match);
it show null;
its worked
$myPattern = "/farsi_persian\/(.*.)\/subtitle-\d*.aspx/";

PHP DomDocument to replace pattern

I need to find and replace http links to hyperlinks. These http links are inside span tags.
$text has html page. One of the span tags has something like
<span class="styleonetwo" >http://www.cnn.com/live-event</span>
Here is my code:
$doc = new DOMDocument();
$doc->loadHTML($text);
foreach($doc->getElementsByTagName('span') as $anchor) {
$link = $anchor->nodeValue;
if(substr($link, 0, 4) == "http")
{
$link = "$link";
}
if(substr($link, 0, 3) == "www")
{
$link = "$link";
}
$anchor->nodeValue = $link;
}
echo $doc->saveHTML();
It works ok. However...I want this to work even if the data inside span is something like:
<span class="styleonetwo" > sometexthere http://www.cnn.com/live-event somemoretexthere</span>
Obviously above code wont work for this situation. Is there a way we can search and replace a pattern using DOMDocument without using preg_replace?
Update: To answer phil's question regarding preg_replace:
I used regexpal.com to test the following pattern matching:
\b(?:(?:https?|ftp|file)://|(www|ftp)\.)[-A-Z0-9+&##/%?=~_|$!:,.;]*[-A-Z0-9+&##/%=~_|$]
It works great in the regextester provided in regexpal. When I use the same pattern in PHP code, I got tons of weird errors. I got unknown modifier error even for escape character! Following is my code for preg_replace
$httpRegex = '/\b(\?:(\?:https?|ftp|file):\/\/|(www|ftp)\.)[-A-Z0-9+&##/%\?=~_|$!:,.;]*[-A-Z0-9+&##/%=~_|$]/';
$cleanText = preg_replace($httpRegex, "<a href='$0'>$0</a>", $text);
I was so frustrated with "unknown modifiers" and pursued DOMDocument to solve my problem.

Regular expressions well suit this problem - so better use preg_replace.
Now you just have several unescaped delimiters in your pattern, so escape them or choose another character as the delimiter - for instance, ^. Thus, the correct pattern would be:
$httpRegex = '^\b(?:(?:https?|ftp|file):\/\/|(www|ftp)\.)[-A-Z0-9+&##\/%\?=~_|$!:,.;]*[-A-Z0-9+&##\/%=~_|$]^i';

How can I find the rest of a word from a string within it in PHP?

Let's say I have a page I want to scrape for words with "ice" in them, how can I do this easily? I see a lot of scrapers breaking things down into source code, but I don't need this. I just need something that searches through the plain text on the webpage.
Edit: I basically need something to search for .jpeg and find the entire file name. (it is in plain text on the website, not hidden in a tag)

Anything that matches the following is a word with ice in it:
/(\w*)ice(\w*)/i
(Do note that \w matches 0-9 and _ too. The following might give better results: /\b.*?ice\b.*?/i)
UPDATE
To match file names (must not contain whitespace):
/\S+\.jpeg/i
Example:
<?php
$str = 'Picture of me: 238484534.jpeg and someone else img-of-someone.jpeg here';
$cnt = preg_match_all('/\S+\.jpeg/i', $str, $matches);
print_r($matches);

1.do u want to read the word inside the HTML tags too like attribute,textname ?
2.Or only the visible part of the webpage ?
for#1 : solutions are simple and already there as mentioned in other answers.
for#2:
Use PHP DOMDOCUMENT class, and extract and search in innerHTML only.
documentation here :
http://php.net/manual/en/class.domdocument.php
see this for example:
PHP DOMDocument stripping HTML tags

Some regex use will be needed for this. Below I use PCRE http://www.php.net/manual/en/ref.pcre.php and the function preg_match http://www.php.net/manual/en/function.preg-match-all.php
<?php
$html = <<<EOF
<html>
<head>
<title>Test</title>
</head>
<body>List of files:
<ul>
<li>test1.jpeg</li>
<li>test2.jpeg</li>
</ul>
</body>
</html>
EOF;
$matches = array();
$count = preg_match_all("([0-9a-zA-Z_-]+\.jpeg)", $html, $matches);
if (count($matches) > 1) {
for ($i = 1; $i < count($matches); $i++) {
print "Filename: {$matches[$i]}\n";
}
}
?>

try this:
preg_match_all('/\w*ice\w*/', 'abc icecream lice', $matches);
print_r($matches);

Unable to use regex to search in PHP?

I'm trying to get the code of a html document in specific tags.
My method works for some tags, but not all, and it not work for the tag's content I want to get.
Here is my code:
<html>
<head></head>
<body>
<?php
$url = "http://sf.backpage.com/MusicInstruction/";
$data = file_get_contents($url);
$pattern = "/<div class=\"cat\">(.*)<\/div>/";
preg_match_all($pattern, $data, $adsLinks, PREG_SET_ORDER);
var_dump($adsLinks);
foreach ($adsLinks as $i) {
echo "<div class='ads'>".$i[0]."</div>";
}
?>
</body>
</html>
The above code doesn't work, but it works when I change the $pattern into:
$pattern = "/<div class=\"date\">(.*)<\/div>/";
or
$pattern = "/<div class=\"sponsorBoxPlusImages\">(.*)<\/div>/";
I can't see any different between these $pattern. Please help me find the error.
Thanks.

Use PHP DOM to parse HTML instead of regex.
For example in your case (code updated to show HTML):
$doc = new DOMDocument();
#$doc->loadHTML(file_get_contents("http://sf.backpage.com/MusicInstruction/"));
$nodes = $doc->getElementsByTagName('div');
for ($i = 0; $i < $nodes->length; $i ++)
{
$x = $nodes->item($i);
if($x->getAttribute('class') == 'cat');
echo htmlspecialchars($x->nodeValue) . "<hr/>"; //this is the element that you want
}

The reason your regex fails is that you are expecting . to match newlines, and it won't unless you use the s modifier, so try
$pattern = "/<div class=\"cat\">(.*)<\/div>/s";
When you do this, you might find the pattern a little too greedy as it will try to capture everything up to the last closing div element. To make it non-greedy, and just match up the very next closing div, add a ? after the *
$pattern = "/<div class=\"cat\">(.*?)<\/div>/s";
This just serves to illustrate that for all but the simplest cases, parsing HTML with regexes is the road to madness. So try using DOM functions for parsing HTML.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP regular expression help? - php

Related

Regex to select url except when = is directly infront of it

regular expression for href from html page in php ?

PHP DomDocument to replace pattern

How can I find the rest of a word from a string within it in PHP?

Unable to use regex to search in PHP?

Categories

Resources