I have problem writing a regular express which match with only div class name "classBig1" and has one anchor link as its child.
Here is my code but it doesn't work:
preg_match_all ("/<div class=\"headline9\"><a[\s]+[^>]*?href[\s]?=[\s\"\']+".
"(.*?)[\"\']+.*?>"."([^<]+|.*?)?<\/a></div>/",
$var, &$matches);
//example HTML: <div class="classBig1">Go Index99</div>
If the HTML is as well formed as your example then the following regex is enough to solve your problem:
<div class="classBig1"><a .*?</div>
The full PHP code would be:
preg_match_all('%<div class="classBig1"><a .*?</div>%', $html,
$result, PREG_PATTERN_ORDER);
for ($i = 0; $i < count($result[0]); $i++) {
$match = $result[0][$i];
}
I guess you had mentioned a wrong class-name in the code, but I consider it is "classBig1" - please take a look at the pattern that I have given.
I believe:
You just wanted to get those "DIV" which has a class of "classBig1"
These "DIVs" should have only one "A" tag.
If yes, then don't hesitate to grab this piece of code :-).
It seems to be working for me when I tried with a sample HTML code.
Pattern:
"/<div class=\"classBig1\"><a (.*)<\/a><\/div>/"
Hope it helps.
Related
I'm trying to use a regex to find and replace all URLs in a forum system. This works but it also selects anything that is within bbcode. This shouldn't be happening.
My code is as follows:
<?php
function make_links_clickable($text){
return preg_replace('!(([^=](f|ht)tp(s)?://)[-a-zA-Zа-яА-Я()0-9#:%_+.~#?&;//=]+)!i', '$1', $text);
}
//$text = "https://www.mcgamerzone.com<br>http://www.mcgamerzone.com/help/support<br>Just text<br>http://www.google.com/<br><b>More text</b>";
$text = "#Theareak We know this and [b][url=https://www.mcgamerzone.com/news/67/False-positive-proxy-bans-and-bot-attacks]here[/url] [/b]is an explanation, we are trying to fix this asap! https://www.mcgamerzone.com/news/67/False-positive-proxy-bans-and-bot-attacks aaa";
echo "<b>Unparsed text:</b><br>";
echo $text;
echo "<br><br>";
echo "<b>Parsed text:</b><br>";
echo make_links_clickable($text);
?>
All urls that occur in bb-code are following up on a = character, meaning that I don't want anything that starts with = to be selected.
I basically have that working but this results in selecting 1 extra character in in front of the string that should be selected.
I'm not very familiar with regex. The final output of my code is this:
<b>Unparsed text:</b><br>
#Theareak We know this and [b][url=https://www.mcgamerzone.com/news/67/False-positive-proxy-bans-and-bot-attacks]here[/url] [/b]is an explanation, we are trying to fix this asap! https://www.mcgamerzone.com/news/67/False-positive-proxy-bans-and-bot-attacks aaa<br>
<br>
<b>Parsed text:</b><br>
#Theareak We know this and [b][url=https://www.mcgamerzone.com/news/67/False-positive-proxy-bans-and-bot-attacks]here[/url] [/b]is an explanation, we are trying to fix this asap! https://www.mcgamerzone.com/news/67/False-positive-proxy-bans-and-bot-attacks aaa
You can match and skip [url=...] like this:
\[url=[^\]]*](*SKIP)(?!)|(((f|ht)tps?://)[-a-zA-Zа-яёЁА-Я()0-9#:%_+.\~#?&;/=]+)
See regex demo
That way, you will only match the URLs outside the [url=...] tag.
IDEONE demo:
function make_links_clickable($text){
return preg_replace('~\[url=[^\]]*](*SKIP)(?!)|(((f|ht)tps?://)[-a-zA-Zа-яёЁА-Я()0-9#:%_+.\~#?&;/=]+)~iu', '$1', $text);
}
$text = "#Theareak We know this and [b][url=https://www.mcgamerzone.com/news/67/False-positive-proxy-bans-and-bot-attacks]here[/url] [/b]is an explanation, we are trying to fix this asap! https://www.mcgamerzone.com/news/67/False-positive-proxy-bans-and-bot-attacks aaa";
echo "<b>Parsed text:</b><br>";
echo make_links_clickable($text);
You can use a negative lookbehind (?<!=) instead of your negated class. It asserts that what is going to be matched isn't preceded by something.
Example
i want get all/farsi_persian/*/subtitle-*.aspx from one html page
i try some regular expression on PHP but not find
can help me ?>
<span class="r0" title="Rating 0 out of 10">Farsi/Persian</span> Arrow - Third Season subtitle<br><small> Arrow.S03E03.HDTV.480p.x264-LOL </small>
Try
/farsi_persian/[^/]+/subtitle-[^.]+.aspx
try using preg_match_all(), this might work::
preg_match_all('/\/farsi_persian\/[\w-]+\/subtitle-[\d]+.aspx/', $str, $matches);
assuming there's always numbers after subtitle-
$myPattern = "/farsi_persian/[^/]+/subtitle-[^.]+.aspx";
preg_match_all($myPattern,$myText,$match);
var_dump($match);
it show null;
its worked
$myPattern = "/farsi_persian\/(.*.)\/subtitle-\d*.aspx/";
I need to find and replace http links to hyperlinks. These http links are inside span tags.
$text has html page. One of the span tags has something like
<span class="styleonetwo" >http://www.cnn.com/live-event</span>
Here is my code:
$doc = new DOMDocument();
$doc->loadHTML($text);
foreach($doc->getElementsByTagName('span') as $anchor) {
$link = $anchor->nodeValue;
if(substr($link, 0, 4) == "http")
{
$link = "$link";
}
if(substr($link, 0, 3) == "www")
{
$link = "$link";
}
$anchor->nodeValue = $link;
}
echo $doc->saveHTML();
It works ok. However...I want this to work even if the data inside span is something like:
<span class="styleonetwo" > sometexthere http://www.cnn.com/live-event somemoretexthere</span>
Obviously above code wont work for this situation. Is there a way we can search and replace a pattern using DOMDocument without using preg_replace?
Update: To answer phil's question regarding preg_replace:
I used regexpal.com to test the following pattern matching:
\b(?:(?:https?|ftp|file)://|(www|ftp)\.)[-A-Z0-9+&##/%?=~_|$!:,.;]*[-A-Z0-9+&##/%=~_|$]
It works great in the regextester provided in regexpal. When I use the same pattern in PHP code, I got tons of weird errors. I got unknown modifier error even for escape character! Following is my code for preg_replace
$httpRegex = '/\b(\?:(\?:https?|ftp|file):\/\/|(www|ftp)\.)[-A-Z0-9+&##/%\?=~_|$!:,.;]*[-A-Z0-9+&##/%=~_|$]/';
$cleanText = preg_replace($httpRegex, "<a href='$0'>$0</a>", $text);
I was so frustrated with "unknown modifiers" and pursued DOMDocument to solve my problem.
Regular expressions well suit this problem - so better use preg_replace.
Now you just have several unescaped delimiters in your pattern, so escape them or choose another character as the delimiter - for instance, ^. Thus, the correct pattern would be:
$httpRegex = '^\b(?:(?:https?|ftp|file):\/\/|(www|ftp)\.)[-A-Z0-9+&##\/%\?=~_|$!:,.;]*[-A-Z0-9+&##\/%=~_|$]^i';
Let's say I have a page I want to scrape for words with "ice" in them, how can I do this easily? I see a lot of scrapers breaking things down into source code, but I don't need this. I just need something that searches through the plain text on the webpage.
Edit: I basically need something to search for .jpeg and find the entire file name. (it is in plain text on the website, not hidden in a tag)
Anything that matches the following is a word with ice in it:
/(\w*)ice(\w*)/i
(Do note that \w matches 0-9 and _ too. The following might give better results: /\b.*?ice\b.*?/i)
UPDATE
To match file names (must not contain whitespace):
/\S+\.jpeg/i
Example:
<?php
$str = 'Picture of me: 238484534.jpeg and someone else img-of-someone.jpeg here';
$cnt = preg_match_all('/\S+\.jpeg/i', $str, $matches);
print_r($matches);
1.do u want to read the word inside the HTML tags too like attribute,textname ?
2.Or only the visible part of the webpage ?
for#1 : solutions are simple and already there as mentioned in other answers.
for#2:
Use PHP DOMDOCUMENT class, and extract and search in innerHTML only.
documentation here :
http://php.net/manual/en/class.domdocument.php
see this for example:
PHP DOMDocument stripping HTML tags
Some regex use will be needed for this. Below I use PCRE http://www.php.net/manual/en/ref.pcre.php and the function preg_match http://www.php.net/manual/en/function.preg-match-all.php
<?php
$html = <<<EOF
<html>
<head>
<title>Test</title>
</head>
<body>List of files:
<ul>
<li>test1.jpeg</li>
<li>test2.jpeg</li>
</ul>
</body>
</html>
EOF;
$matches = array();
$count = preg_match_all("([0-9a-zA-Z_-]+\.jpeg)", $html, $matches);
if (count($matches) > 1) {
for ($i = 1; $i < count($matches); $i++) {
print "Filename: {$matches[$i]}\n";
}
}
?>
try this:
preg_match_all('/\w*ice\w*/', 'abc icecream lice', $matches);
print_r($matches);
I'm trying to get the code of a html document in specific tags.
My method works for some tags, but not all, and it not work for the tag's content I want to get.
Here is my code:
<html>
<head></head>
<body>
<?php
$url = "http://sf.backpage.com/MusicInstruction/";
$data = file_get_contents($url);
$pattern = "/<div class=\"cat\">(.*)<\/div>/";
preg_match_all($pattern, $data, $adsLinks, PREG_SET_ORDER);
var_dump($adsLinks);
foreach ($adsLinks as $i) {
echo "<div class='ads'>".$i[0]."</div>";
}
?>
</body>
</html>
The above code doesn't work, but it works when I change the $pattern into:
$pattern = "/<div class=\"date\">(.*)<\/div>/";
or
$pattern = "/<div class=\"sponsorBoxPlusImages\">(.*)<\/div>/";
I can't see any different between these $pattern. Please help me find the error.
Thanks.
Use PHP DOM to parse HTML instead of regex.
For example in your case (code updated to show HTML):
$doc = new DOMDocument();
#$doc->loadHTML(file_get_contents("http://sf.backpage.com/MusicInstruction/"));
$nodes = $doc->getElementsByTagName('div');
for ($i = 0; $i < $nodes->length; $i ++)
{
$x = $nodes->item($i);
if($x->getAttribute('class') == 'cat');
echo htmlspecialchars($x->nodeValue) . "<hr/>"; //this is the element that you want
}
The reason your regex fails is that you are expecting . to match newlines, and it won't unless you use the s modifier, so try
$pattern = "/<div class=\"cat\">(.*)<\/div>/s";
When you do this, you might find the pattern a little too greedy as it will try to capture everything up to the last closing div element. To make it non-greedy, and just match up the very next closing div, add a ? after the *
$pattern = "/<div class=\"cat\">(.*?)<\/div>/s";
This just serves to illustrate that for all but the simplest cases, parsing HTML with regexes is the road to madness. So try using DOM functions for parsing HTML.