I have a string that looks like:
">ANY CONTENT</span>(<a id="show
I need to fetch ANY CONTENT. However, there are spaces in between
</span> and (<a id="show
Here is my preg_match:
$success = preg_match('#">(.*?)</span>\s*\(<a id="show#s', $basicPage, $content);
\s* represents spaces. I get an empty array!
Any idea how to fetch CONTENT?
Use a real HTML parser. Regular expressions are not really suitable for the job. See this answer for more detail.
You can use DOMDocument::loadHTML() to parse into a structured DOM object that you can then query, like this very basic example (you need to do error checking though):
$dom = new DOMDocument;
$dom->loadHTML($data);
$span = $dom->getElementsByTagName('span');
$content = $span->item(0)->textContent;
I just had to:
">
define the above properly, because "> were too many in the page, so it didn't know which one to choose specficially. Therefore, it returned everything before "> until it hits (
Solution:
.">
Sample:
$success = preg_match('#\.">(.*?)</span>\s*\(<a id="show#s', $basicPage, $content);
Related
Let's use 3 string examples:
Example 1:
<div id="something">I have a really nice signature, it goes like this</div>
Example 2:
<div>I like balloons</div><div id="signature-xyz">Sent from my iPhone</div>
Example 3:
<div>I like balloons</div><div class="my_signature-xyz">Get iOS</div>
I'd like to remove the entire contents of the "signature" div in examples 2 and 3. Example 1 should not be affected. I don't know ahead of time as to what the div's exact class or ID will be, but I do know it will contain the string 'signature'.
I'm using the code below, which gets me half way there.
$pm = "/signature/i";
if (preg_match($pm, $message, $matches) == 1) {
$message = preg_split($pm, $message, 2)[0];
}
What should I do to achieve the above? Thanks
You can use the following sample to build your code on it:
$dom = new DOMDocument();
$dom->loadHTML($inputHTML);
$xpathsearch = new DOMXPath($dom);
$nodes = $xpathsearch->query("//div[not(contains(#*,'signature'))]");
foreach($nodes as $node) {
//do your stuff
}
Where the xpath:
//div[not(contains(#*,'signature'))]
will allow you to extract all div nodes for which there is no attribute that contains the string signature.
Regex should never being used in HTML/XML/JSON parsing where you can
have theoretically infinite nested depth in the structure. Ref:
Regular Expression Vs. String Parsing
I am using explode to manipulate information I am scraping from a website. I am trying to eliminate something specific from the string so that it will return what I want and also add the rest of the items to the array.
$pageArray = explode('<td class="player-label"><a href="/nfl/players/antonio-brown.php?type=overall&week=draft">', $fantasyPros);
I would like to skip the antonio-brown section and use a regular expression or whatever is best to replace it so that it will not look for a specific name but every name on the list and add them to my array. Do you have any suggestions on what I should use here? I appreciate any assistance.
Seems like a parser job to me with appropriate xpath functions, e.g. not().
Consider the following code:
<?php
$data = <<<DATA
<td class="player-label">
Some brown link here
Some green link here
</td>
DATA;
$dom = new DOMDocument();
$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
$green_links = $xpath->query("//a[not(contains(#href, 'antonio-brown'))]");
foreach ($green_links as $link) {
// do sth. useful here
}
?>
This prints out every link where there's no antonio-brown in it.
You can easily adjust this to td or any other element.
I'm currently working on a script to archive an imageboard.
I'm kinda stuck on making links reference correctly, so I could use some help.
I receive this string:
>>10028949<br><br>who that guy???
In said string, I need to alter this part:
<a href="10028949#p10028949"
to become this:
<a href="#p10028949"
using PHP.
This part may appear more than once in the string, or might not appear at all.
I'd really appreciate it if you had a code snippet I could use for this purpose.
Thanks in advance!
Kenny
Disclaimer: as it'll be said in the comments, using a DOM parser is better to parse HTML.
That being said:
"/(<a[^>]*?href=")\d+(#[^"]+")/"
replaced by $1$2
So...
$myString = preg_replace("/(<a[^>]*?href=\")\d+(#[^\"]+\")/", "$1$2", $myString);
try this
>>10028949<br><br>who that guy???
Although you have the question already answered I invite you to see what would (approximately xD) be the correct approach, parsing it with DOM:
$string = '>>10028949<br><br>who that guy???';
$dom = new DOMDocument();
$dom->loadHTML($string);
$links = $dom->getElementsByTagName('a'); // This stores all the links in an array (actually a nodeList Object)
foreach($links as $link){
$href = $link->getAttribute('href'); //getting the href
$cut = strpos($href, '#');
$new_href = substr($href, $cut); //cutting the string by the #
$link->setAttribute('href', $new_href); //setting the good href
}
$body = $dom->getElementsByTagName('body')->item(0); //selecting everything
$output = $dom->saveHTML($body); //passing it into a string
echo $output;
The advantages of doing it this way is:
More organized / Cleaner
Easier to read by others
You could for example, have mixed links, and you only want to modify some of them. Using Dom you can actually select certain classes only
You can change other attributes as well, or the selected tag's siblings, parents, children, etc...
Of course you could achieve the last 2 points with regex as well but it would be a complete mess...
I have a string of data that is set as $content, an example of this data is as follows
This is some sample data which is going to contain an image in the format <img src="http://www.randomdomain.com/randomfolder/randomimagename.jpg">. It will also contain lots of other text and maybe another image or two.
I am trying to grab just the <img src="http://www.randomdomain.com/randomfolder/randomimagename.jpg"> and save it as another string for example $extracted_image
I have this so far....
if( preg_match_all( '/<img[^>]+src\s*=\s*["\']?([^"\' ]+)[^>]*>/', $content, $extracted_image ) ) {
$new_content .= 'NEW CONTENT IS '.$extracted_image.'';
All it is returning is...
NEW CONTENT IS Array
I realise my attempt is probably completly wrong but can someone tell me where I am going wrong?
Your first problem is that http://php.net/manual/en/function.preg-match-all.php places an array into $matches, so you should be outputting the individual item(s) from the array. Try $extracted_image[0] to start.
You need to use a different function, if you only want one result:
preg_match() returns the first and only the first match.
preg_match_all() returns an array with all the matches.
Using regex to parse valid html is ill-advised. Because there can be unexpected attributes before the src attribute, because non-img tags can trick the regular expression into false-positive matching, and because attribute values can be quoted with single or double quotes, you should use a dom parser. It is clean, reliable, and easy to read.
Code: (Demo)
$string = <<<HTML
This is some sample data which is going to contain an image
in the format <img src="http://www.randomdomain.com/randomfolder/randomimagename.jpg">.
It will also contain lots of other text and maybe another image or two
like this: <img alt='another image' src='http://www.example.com/randomfolder/randomimagename.jpg'>
HTML;
$srcs = [];
$dom=new DOMDocument;
$dom->loadHTML($string);
foreach ($dom->getElementsByTagName('img') as $img) {
$srcs[] = $img->getAttribute('src');
}
var_export($srcs);
Output:
array (
0 => 'http://www.randomdomain.com/randomfolder/randomimagename.jpg',
1 => 'http://www.example.com/randomfolder/randomimagename.jpg',
)
How do I ignore html tags in this preg_replace.
I have a foreach function for a search, so if someone searches for "apple span" the preg_replace also applies a span to the span and the html breaks:
preg_replace("/($keyword)/i","<span class=\"search_hightlight\">$1</span>",$str);
Thanks in advance!
I assume you should make your function based on DOMDocument and DOMXPath rather than using regular expressions. Even those are quite powerful, you run into problems like the one you describe which are not (always) easily and robust to solve with regular expressions.
The general saying is: Don't parse HTML with regular expressions.
It's a good rule to keep in mind and albeit as with any rule, it does not always apply, it's worth to make up one's mind about it.
XPath allows you so find all texts that contain the search terms within texts only, ignoring all XML elements.
Then you only need to wrap those texts into the <span> and you're done.
Edit: Finally some code ;)
First it makes use of xpath to locate elements that contain the search text. My query looks like this, this might be written better, I'm not a super xpath pro:
'//*[contains(., "'.$search.'")]/*[FALSE = contains(., "'.$search.'")]/..'
$search contains the text to search for, not containing any " (quote) character (this would break it, see Cleaning/sanitizing xpath attributes for a workaround if you need quotes).
This query will return all parents that contain textnodes which put together will be a string that contain your search term.
As such a list is not easy to process further as-is, I created a TextRange class that represents a list of DOMText nodes. It is useful to do string-operations on a list of textnodes as if they were one string.
This is the base skeleton of the routine:
$str = '...'; # some XML
$search = 'text that span';
printf("Searching for: (%d) '%s'\n", strlen($search), $search);
$doc = new DOMDocument;
$doc->loadXML($str);
$xp = new DOMXPath($doc);
$anchor = $doc->getElementsByTagName('body')->item(0);
if (!$anchor)
{
throw new Exception('Anchor element not found.');
}
// search elements that contain the search-text
$r = $xp->query('//*[contains(., "'.$search.'")]/*[FALSE = contains(., "'.$search.'")]/..', $anchor);
if (!$r)
{
throw new Exception('XPath failed.');
}
// process search results
foreach($r as $i => $node)
{
$textNodes = $xp->query('.//child::text()', $node);
// extract $search textnode ranges, create fitting nodes if necessary
$range = new TextRange($textNodes);
$ranges = array();
while(FALSE !== $start = strpos($range, $search))
{
$base = $range->split($start);
$range = $base->split(strlen($search));
$ranges[] = $base;
};
// wrap every each matching textnode
foreach($ranges as $range)
{
foreach($range->getNodes() as $node)
{
$span = $doc->createElement('span');
$span->setAttribute('class', 'search_hightlight');
$node = $node->parentNode->replaceChild($span, $node);
$span->appendChild($node);
}
}
}
For my example XML:
<html>
<body>
This is some <span>text</span> that span across a page to search in.
and more text that span</body>
</html>
It produces the following result:
<html>
<body>
This is some <span><span class="search_hightlight">text</span></span><span class="search_hightlight"> that span</span> across a page to search in.
and more <span class="search_hightlight">text that span</span></body>
</html>
This shows that this even allows to find text that is distributed across multiple tags. That's not that easily possible with regular expressions at all.
You find the full code here: http://codepad.viper-7.com/U4bxbe (including the TextRange class that I have taken out of the answers example).
It's not working properly on the viper codepad because of an older LIBXML version that site is using. It works fine for my LIBXML version 20707. I created a related question about this issue: XPath query result order.
A note of warning: This example uses binary string search (strpos) and the related offsets for splitting textnodes with the DOMText::splitText function. That can lead to wrong offsets, as the functions needs the UTF-8 character offset. The correct method is to use mb_strpos to obtain the UTF-8 based value.
The example works anyway because it's only making use of US-ASCII which has the same offsets as UTF-8 for the example-data.
For a real life situation, the $search string should be UTF-8 encoded and mb_strpos should be used instead of strpos:
while(FALSE !== $start = mb_strpos($range, $search, 0, 'UTF-8'))