Locating specific string and capturing data following it - php

I built a site a long time ago and now I want to place the data into a database without copying and pasting the 400+ pages that it has grown to so that I can make the site database driven.
My site has meta tags like this (each page different):
<meta name="clan_name" content="Dark Mage" />
So what I'm doing is using cURL to place the entire HTML page in a variable as a string. I can also do it with fopen etc..., but I don't think it matters.
I need to shift through the string to find 'Dark Mage' and store it in a variable (so i can put into sql)
Any ideas on the best way to find Dark Mage to store in a variable? I was trying to use substr and then just subtracting the number of characters from the e in clan_name, but that was a bust.

Just parse the page using the PHP DOM functions, specifically loadHTML(). You can then walk the tree or use xpath to find the nodes you are looking for.
<?
$doc = new DomDocument;
$doc->loadHTML($html);
$meta = $doc->getElementsByTagName('meta');
foreach ($meta as $data) {
$name = $meta->getAttribute('name');
if ($name == 'clan_name') {
$content = $meta->getAttribute('content');
// TODO handle content for clan_name
}
}
?>
EDIT If you want to remove certain tags (such as <script>) before you load your HTML string into memory, try using the strip_tags() function. Something like this will keep only the meta tags:
<?
$html = strip_tags($html, '<meta>');
?>

Use a regular expression like the following, with PHP's preg_match():
/<meta name="clan_name" content="([^"]+)"/
If you're not familiar with regular expressions, read on.
The forward-slashes at the beginning and end delimit the regular expression. The stuff inside the delimiters is pretty straightforward except toward the end.
The square-brackets delimit a character class, and the caret at the beginning of the character-class is a negation-operator; taken together, then, this character class:
[^"]
means "match any character that is not a double-quote".
The + is a quantifier which requires that the preceding item occur at least once, and matches as many of the preceding item as appear adjacent to the first. So this:
[^"]+
means "match one or more characters that are not double-quotes".
Finally, the parentheses cause the regular-expression engine to store anything between them in a subpattern. So this:
([^"]+)
means "match one or more characters that are not double-quotes and store them as a matched subpattern.
In PHP, preg_match() stores matches in an array that you pass by reference. The full pattern is stored in the first element of the array, the first sub-pattern in the second element, and so forth if there are additional sub-patterns.
So, assuming your HTML page is in the variable "$page", the following code:
$matches = array();
$found = preg_match('/<meta name="clan_name" content="([^"]+)"/', $page, $matches);
if ($found) {
$clan_name = $matches[1];
}
Should get you what you want.

Use preg_match. A possible regular expression pattern is /clan_name.+content="([^"]+)"/

Related

Different results between preg_replace & preg_match_all

I have a forum that supports hashtags. I'm using the following line to convert all hashtags into links. I'm using the (^|\(|\s|>) pattern to avoid picking up named anchors in URLs.
$str=preg_replace("/(^|\(|\s|>)(#(\w+))/","$1$2",$str);
I'm using this line to pick up hashtags to store them in a separate field when the user posts their message, this picks up all hashtags EXCEPT those at the start of a new line.
preg_match_all("/(^|\(|\s|>)(#(\w+))/",$Content,$Matches);
Using the m & s modifiers doesn't make any difference. What am I doing wrong in the second instance?
Edit: the input text could be plain text or HTML. Example of problem input:
#startoftextreplacesandmatches #afterwhitespacereplacesandmatches <b>#insidehtmltagreplacesandmatches</b> :)
#startofnewlinereplacesbutdoesnotmatch :(
Your replace operation has a problem which you have evidently not yet come across - it will allow unescaped HTML special characters through. The reason I know this is because your regex allows hashtags to be prefixed with >, which is a special character.
For that reason, I recommend you use this code to do the replacement, which will double up as the code for extracting the tags to be inserted into the database:
$hashtags = array();
$expr = '/(?:(?:(^|[(>\s])#(\w+))|(?P<notag>.+?))/';
$str = preg_replace_callback($expr, function($matches) use (&$hashtags) {
if (!empty($matches['notag'])) {
// This takes care of HTML special characters outside hashtags
return htmlspecialchars($matches['notag']);
} else {
// Handle hashtags
$hashtags[] = $matches[2];
return htmlspecialchars($matches[1]).'#'.htmlspecialchars($matches[2]).'';
}
}, $str);
After the above code has been run, $str will contain the modified string, properly escaped for direct output, and $hashtags will be populated with all the tags matched.
See it working

replace special strings in a html page by php

I am looking for a way to replace all string looking alike in entire page with their defined values
Please do not recommend me other methods of including language constants.
Strings like this :
[_HOME]
[_NEWS]
all of them are looking the same in [_*] part
Now the big issue is how to scan a HTML page and to replace the defined values .
One ways to parse the html page is to use DOMDocument and then pre_replace() it
but my main problem is writing a pattern for the replacement
$pattern = "/[_i]/";
$replacement= custom_lang("/i/");
$doc = new DOMDocument();
$htmlPage = $doc->loadHTML($html);
preg_replace($pattern, $replacement, $htmlPage);
In RegEx, [] are operators, so if you use them you need to escape them.
Other problem with your expression is _* which will match Zero or more _. You need to replace it with some meaningful match, Like, _.* which will match _ and any other characters after that. SO your full expression becomes,
/\[_.*?\]/
Hey, why an ?, you might be tempted to ask: The reason being that it performs a non-greedy match. Like,
[_foo] [_bar] is the query string then a greedy match shall return one match and give you the whole of it because your expression is fully valid for the string but a non-greedy match will get you two seperate matches. (More information)
You might be better-off in being more constrictive, by having an _ followed by Capital letters. Like,
/\[_[A-Z]+\]/
Update: Using the matched strings and replacing them. To do so we use the concept called back-refrencing.
Consider modifying the above expression, enclosing the string in parentheses, like, /\[_([A-Z]+)\]/
Now in preg-replace arguments we can use the expression in parentheses by back-referencing them with $1. So what you can use is,
preg_replce("/\[_([A-Z]+)\]/e", "my_wonderful_replacer('$1')", $html);
Note: We needed the e modifier to treat the second parameter as PHP code. (More information)
If you know the full keyword you are trying to replace (e.g. [_HOME]), then you can just use str_replace() to replace all instances.
No need to make things like this more complex by introducing regex.

Regular expression to put <P> AND <ul>/<ol> into array

I'm searching for a function in PHP to put every paragraph element like <p>, <ul> and <ol> into an array. So that i can manipulate the paragraph, like displayen the first two paragraphs and hiding the others.
This function does the trick for the p-element. How can i adjust the regexp to also match the ul and ol? My tryout gives an error: complaining the < is not an operator...
function aantalP($in){
preg_match_all("|<p>(.*)</p>|U",
$in,
$out, PREG_PATTERN_ORDER);
return $out;
}
//tryout:
function aantalPT($in){
preg_match_all("|(<p> | <ol>)(.*)(</p>|</o>)|U",
$in,
$out, PREG_PATTERN_ORDER);
return $out;
}
Can anyone help me?
You can't do this reliably with regular expressions. Paragraphs are mostly OK because they're not nested generally (although they can be). Lists however are routinely nested and that's one area where regular expressions fall down.
PHP has multiple ways of parsing HTML and retrieving selected elements. Just use one of those. It'll be far more robust.
Start with Parse HTML With PHP And DOM.
If you really want to go down the regex route, start with:
function aantalPT($in){
preg_match_all('!<(p|ol)>(.*)</\1>!Us', $in, $out);
return $out;
}
Note: PREG_PATTERN_ORDER is not required as it is the default value.
Basically, use a backreference to find the matching tag. That will fail for many reasons such as nested lists and paragraphs nested within lists. And no, those problems are not solvable (reliably) with regular expressions.
Edit: as (correctly) pointed out, the regex is also flawed in that it used a pipe delimeter and you were using a pipe character in your regex. I generally use ! as that doesn't normally occur in the pattern (not in my patterns anyway). Some use forward slashes but they appear in this pattern too. Tilde (~) is another reasonably common choice.
First of all, you use | as delimiter to mark the beginning and end of the regular expression. But you also use | as the or sign. I suggest you replace the first and last | with #.
Secondly, you should use backreferences with capture of the start and end tag like such: <(p|ul)>(.*?)</\1>

Get last <li> element from a string

I have a string variable that contains a lot of HTML markup and I want to get the last <li> element from it.
Im using something like:
$markup = "<body><div><li id='first'>One</li><li id='second'>Two</li><li id='third'>Three</li></div></body>";
preg_match('#<li(.*?)>(.*)</li>#ims', $markup, $matches);
$lis = "<li ".$matches[1].">".$matches[2]."</li>";
$total = explode("</li>",$lis);
$num = count($total)-2;
echo $total[$num]."</li>";
This works and I get the last <li> element printed. But I cant understand why I have to subtract the last 2 indexes of the array $total. Normally I would only subtract the last index since counting starts on index 0. What im i missing?
Is there a better way of getting the last <li> element from the string?
HTML is not regular, and so can't be parsed with a regular expression. Use a proper HTML parser.
#OP, your requirement looks simple, so no need for parsers or regex.
$markup = "<body><div><li id='first'>One</li><li id='second'>Two</li><li id='third'>Three</li></div></body>";
$s = explode("</li>",$markup,-1);
$t = explode(">",end($s));
print end($t);
output
$ php test.php
Three
If you already know how to use jQuery, you could also take a look at phpQuery. It's a PHP library that allows you to easily access dom elements, just like in jQuery.
From the PHP.net documentation:
If matches is provided, then it is filled with the results of search. $matches[0] will contain the text that matched the full pattern, $matches[1] will have the text that matched the first captured parenthesized subpattern, and so on.
$matches[0] is the complete match (not just the captured bits)
You have to extract the second index because you have 2 capturing groupds:
$matches[0]; // Contains your original string
$matches[1]; // Contains the argument for the LI start-tag (.*?)
$matches[2]; // Contains the string contained by the LI tags (.*)
'parsing' (x)html strings is with regular expressions is hard and can be full of unexpected problems. parsing more than simple tagged strings is not possible because (x)html is not a regular language.
you could improve your regex by using (not tested):
/#<li([^>]*)>(.+?)</li>#ims/
strrpos — Find position of last occurrence of a char in a string

Replacing Tags with Includes in PHP with RegExps

I need to read a string, detect a {VAR}, and then do a file_get_contents('VAR.php') in place of {VAR}. The "VAR" can be named anything, like TEST, or CONTACT-FORM, etc. I don't want to know what VAR is -- not to do a hard-coded condition, but to just see an uppercase alphanumeric tag surrounded by curly braces and just do a file_get_contents() to load it.
I know I need to use preg_match and preg_replace, but I'm stumbling through the RegExps on this.
How is this useful? It's useful in hooking WordPress.
Orion above has a right solution, but it's not really necessary to use a callback function in your simple case.
Assuming that the filenames are A-Z + hyphens you can do it in 1 line using PHP's /e flag in the regex:
$str = preg_replace('/{([-A-Z]+)}/e', 'file_get_contents(\'$1.html\')', $str);
This'll replace any instance of {VAR} with the contents of VAR.html. You could prefix a path into the second term if you need to specify a particular directory.
There are the same vague security worries as outlined above, but I can't think of anything specific.
You'll need to do a number of things. I'm assuming you can do the legwork to get the page data you want to preprocess into a string.
First, you'll need the regular expression to match correctly. That should be fairly easy with something like /{\w+}/.
Next you'll need to use all of the flags to preg_match to get the offset location in the page data. This offset will let you divide the string into the before, matching, and after parts of the match.
Once you have the 3 parts, you'll need to run your include, and stick them back together.
Lather, rinse, repeat.
Stop when you find no more variables.
This isn't terribly efficient, and there are probably better ways. You may wish to consider doing a preg_split instead, splitting on /[{}]/. No matter how you slice it you're assuming that you can trust your incoming data, and this will simplify the whole process a lot. To do this, I'd lay out the code like so:
Take your content and split it like so: $parts = preg_split('/[{}]/', $page_string);
Write a recursive function over the parts with the following criteria:
Halt when length of arg is < 3
Else, return a new array composed of
$arg[0] . load_data($arg[1]) . $arg[2]
plus whatever is left in $argv[3...]
Run your function over $parts.
You can do it without regexes (god forbid), something like:
//return true if $str ends with $sub
function endsWith($str,$sub) {
return ( substr( $str, strlen( $str ) - strlen( $sub ) ) === $sub );
}
$theStringWithVars = "blah.php cool.php awesome.php";
$sub = '.php';
$splitStr = split(" ", $theStringWithVars);
for($i=0;$i<count($splitStr);$i++) {
if(endsWith(trim($splitStr[$i]),$sub)) {
//file_get_contents($splitStr[$i]) etc...
}
}
Off the top of my head, you want this:
// load the "template" file
$input = file_get_contents($template_file_name);
// define a callback. Each time the regex matches something, it will call this function.
// whatever this function returns will be inserted as the replacement
function replaceCallback($matches){
// match zero will be the entire match - eg {FOO}.
// match 1 will be just the bits inside the curly braces because of the grouping parens in the regex - eg FOO
// convert it to lowercase and append ".html", so you're loading foo.html
// then return the contents of that file.
// BEWARE. GIANT MASSIVE SECURITY HOLES ABOUND. DO NOT DO THIS
return file_get_contents( strtolower($matches[1]) . ".html" );
};
// run the actual replace method giving it our pattern, the callback, and the input file contents
$output = preg_replace_callback("\{([-A-Z]+)\}", replaceCallback, $input);
// todo: print the output
Now I'll explain the regex
\{([-A-Z]+)\}
The \{ and \} just tell it to match the curly braces. You need the slashes, as { and } are special characters, so they need escaping.
The ( and ) create a grouping. Basically this lets you extract particular parts of the match. I use it in the function above to just match the things inside the braces, without matching the braces themselves. If I didn't do this, then I'd need to strip the { and } out of the match, which would be annoying
The [-A-Z] says "match any uppercase character, or a -
The + after the [-A-Z] means we need to have at least 1 character, but we can have up to any number.
Comparatively speaking, regular expression are expensive. While you may need them to figure out which files to load, you certainly don't need them for doing the replace, and probably shouldn't use regular expressions. After all, you know exactly what you are replacing so why do you need fuzzy search?
Use an associative array and str_replace to do your replacements. str_replace supports arrays for doing multiple substitutions at once. One line substitution, no loops.
For example:
$substitutions = array('{VAR}'=>file_get_contents('VAR.php'),
'{TEST}'=>file_get_contents('TEST.php'),
...
);
$outputContents = str_replace( array_keys($substitutions), $substitutions, $outputContents);

Categories