String parsing help - php

I have a paragraph of text in the following format:
text text text <age>23</age>. text text <hobbies>...</hobbies>
I want to be able to
1) Extract the text found between each <age> and <hobbies> tag found in the string. So for example, I would have an array called $ages which will contain all ages found between all the <age></age> tags, and then another array $hobbies which will have the text between the <hobbies></hobbies> tags found throughout the string.
2) Be able to replace the tags which are extracted with a marker, such as {age_444}, so e.g the above text would become
text text text {age_444}. text text {hobbies_555}
How can this be done?

//Extract the age
preg_match_all("#<age>(.*?)</age>#",$string,$match);
$ages=$match[1];
//Extract the hobby
preg_match_all("#<hobbies>(.*?)</hobbies>#",$string,$match);
$hobbies=$match[1];
//Replace the age
$agefn=create_function('$match','$query=mysql_query("select ageid...where age=".$match[1]); return "<age>{age_".mysql_fetch_object($query)->ageid."}</age>"');
$string=preg_replace_callback("#<age>(.*?)</age>#",$agefn,$string);
//Replace the hobby
$hobfn=create_function('$match','$query=mysql_query("select hobid...where hobby=".$match[1]); return "<hobbies>{hobbies_".mysql_fetch_object($query)->hobid."}</hobbies>"');
$string=preg_replace_callback("#<hobbies>(.*?)</hobbies>#",$hobfn,$string);

If your source document is a kind of well-formed XML (or if it can easily be brought into this shape at least), you can use XSLT/XSL-FO to transform your document.
Finding informations enclosed by <> tags and rearranging/extracting them is one of the main features. You can use XSLT/XSL-FO stand-alone or within various languages (Java, C, even Visual Basic)
What you need is your source document and a document describing the transformation rules. The rendering machine or library function will do the rest.
Hope that helps. Good luck

$string = '<age>23</age><hobbies>hobbietext</hobbies>';
$ageTemp = explode('<age>', $string );
foreach($ageTemp as $key=>$value)
{
$age = explode('</age>', $value);
if(isset($age[0])) $ages[] = $age[0];
}
$hobbiesTemp = explode('<hobbies>', $string );
foreach($hobbiesTemp as $key=>$value)
{
$hobbie = explode('</hobbies>', $value);
if(isset($hobbie[0])) $hobbies[] = $hobbie[0];
}
final arrays are $hobbies and $ages
after that you just replace your sting like this:
foreach($ages as $key=>$value)
{
$string = str_replace('<age>'.$value.'</age>', '{age_'.$yourId.'}', $string);
}
foreach($hobbies as $key=>$value)
{
$string = str_replace('<hobbies>'.$value.'</hobbies>', '{hobbie_'.$yourId.'}', $string);
}

Related

Conversion of text within delimeters to valid url

I have to convert an old website to a CMS and one of the challenges I have is at present there are over 900 folders that contain up to 9 text files in each folder. I need to combine the up to 9 text files into one and then use that file as the import into the CMS.
The file concatenation and import are working perfectly.
The challenge that I have is parsing some of the text in the text file.
The text file contains a url in the form of
Some text [http://xxxxx.com|About something] some more text
I am converting this with this code
if (substr ($line1, 0, 7) !=="Replace") {
$pattern = '/\\[/';
$pattern2 = '/\\]/';
$pattern3 = '/\\|/';
$replacement = '<a href="';
$replacement3 = '">';
$replacement2='</a><br>';
$subject = $line1;
$i=preg_replace($pattern, $replacement, $subject, -1 );
$i=preg_replace($pattern3, $replacement3, $i, -1 );
$i=preg_replace($pattern2, $replacement2, $i, -1 );
$line .= '<div class="'.$folders[$x].'">'.$i.'</div>' ;
}
It may not be the most efficient code but it works and as this is a one off exercise execution time etc is not an issue.
Now to the problem that I cannot seem to code around. Some of the urls in the text files are in this format
Some text [http://xxxx.com] some more text
The pattern matching that I have above finds pattern and pattern2 but as there is no pattern3 the url is malformed in the output.
Regular expressions are not my forte is there a way to modify what I have above or is there another way to get the correctly formatted url in my output or will I need to parse the output a second time looking for the malformed url and correct it before writing it to the output file?
You can use preg_replace_callback() to achieve this:
Find any string of the format [...]
Try to split them by the delimiter | using explode()
If the split array contains two pieces, then it means the [...] string contains two pieces: the link href and the link anchor text
If not, then it means the the [...] string contains only the link href part
Format and return the link
Code:
$input = <<<EOD
Some text [http://xxxxx.com|About something] some more text
Some text [http://xxxx.com] some more text
EOD;
$output = preg_replace_callback('#\[([^\]]+)\]#', function($m)
{
$parts = explode('|', $m[1]);
if (count($parts) == 2)
{
return sprintf('%s', $parts[0], $parts[1]);
}
else
{
return sprintf('%1$s', $m[1]);
}
}, $input);
echo $output;
Output:
Some text About something some more text
Some text http://xxxx.com some more text
Live demo

I need to find a string in a string then replace that and text around it

i have a string that has markers and I need to replace with text from a database. this text string is stored in a database and the markers are for auto fill with data from a different part of the database.
$text = '<span data-field="la_lname" data-table="user_properties">
{Listing Agent Last Name}
</span>
<br>RE: The new offer<br>Please find attached....'
if i can find the data marker by:
strpos($text, 'la_lname');
can i use that to select everything in and between the <span> and </span> tags..
so the new string looks like:
'Sommers<br>RE: The new offer<br>Please find attached....'
I thought I could explode the string based on the <span> tags but that opens up a lot of problems as I need to keep the text intact and formated as it is. I just want to insert the data and leave everything else untouched.
To get what's between two parts of a string
for example if you have
<span>SomeText</span>
If you want to get SomeText then I suggest using a function that gets whatever is between two parts that you put as parameters
<?php
function getbetween($content,$start,$end) {
$r = explode($start, $content);
if (isset($r[1])){
$r = explode($end, $r[1]);
return $r[0];
}
return '';
}
$text = '<span>SomeText</span>';
$start = '<span>';
$end = '</span>';
$required_text = getbetween($text,$start,$end);
$full_line = $start.$required_text.$end;
$text = str_replace($full_line, 'WHAT TO REPLACE IT WITH HERE',$text);
You could try preg_replace or use a DOM Parser, which is far more useful for navigating HTML-like-structure.
I should add that while regular expressions should work just fine in this example, you may need to do more complex things in the future or traverse more intrincate DOM structures for your replacements, so a DOM Parser is the way to go in this case.
Using PHP Simple HTML DOM Parser
$html = str_get_html('<span data-field="la_lname" data-table="user_properties">{Listing Agent Last Name}</span><br>RE: The new offer<br>Please find attached....');
$html->find('span')->innerText = 'New value of span';

Finding and replacing attributes using preg_replace

I am trying to redo some forms that have uppercase field names and spaces, there are hundreds of fields and 50 + forms... I decided to try to write a PHP script that parses through the HTML of the form.
So now I have a textarea that I will post the html into and I want to change all the field names from
name="Here is a form field name"
to
name="here_is_a_form_field_name"
How in one command could I parse through and change it so all in the name tags would be lowercase and spaces replace with underscores
I am assuming preg_replace with an expression?
Thanks!
I would suggest not using regex for manipulation of HTML .. I would use DOMDocument instead, something like the following
$dom = new DOMDocument();
$dom->loadHTMLFile('filename.html');
// loop each textarea
foreach ($dom->getElementsByTagName('textarea') as $item) {
// setup new values ie lowercase and replacing space with underscore
$newval = $item->getAttribute('name');
$newval = str_replace(' ','_',$newval);
$newval = strtolower($newval);
// change attribute
$item->setAttribute('name', $newval);
}
// save the document
$dom->saveHTML();
An alternative would be to use something like Simple HTML DOM Parser for the job - there are some good examples on the linked site
I agree that preg_replace() or rather preg_replace_callback() is the right tool for the job, here's an example of how to use it for your task:
preg_replace_callback('/ name="[^"]"/', function ($matches) {
return str_replace(' ', '_', strtolower($matches[0]))
}, $file_contents);
You should, however, check the results afterwards using a diff tool and fine-tune the pattern if necessary.
The reason why I would recommend against a DOM parser is that they usually choke on invalid HTML or files that contain for example tags for templating engines.
This is your Solution:
<?php
$nameStr = "Here is a form field name";
while (strpos($nameStr, ' ') !== FALSE) {
$nameStr = str_replace(' ', '_', $nameStr);
}
echo $nameStr;
?>

PHP Search Text Highlight Function

I have a PHP highlighting function which makes certain words bold.
Below is the function, and it works great, except when the array: $words contains a single value that is: b
For example someone searches for: jessie j price tag feat b o b
This will have the following entries in the array $words: jessie,j,price,tag,feat,b,o,b
When a 'b' shows up, my whole function goes wrong, and it displays a whole bunch of wrong html tags. Of course I can strip out any 'b' values from the array, but this isn't ideal, as the highlighting isnt working as it should with certain queries.
This sample script:
function highlightWords2($text, $words)
{
$text = ($text);
foreach ($words as $word)
{
$word = preg_quote($word);
$text = preg_replace("/\b($word)\b/i", '<b>$1</b>', $text);
}
return $text;
}
$string = 'jessie j price tag feat b o b';
$words = array('jessie','tag','b','o','b');
echo highlightWords2($string, $words);
Will output:
<<<b>b</b>><b>b</b></<b>b</b>>>jessie</<<b>b</b>><b>b</b></<b>b</b>>> j price <<<b>b</b>><b>b</b></<b>b</b>>>tag</<<b>b</b>><b>b</b></<b>b</b>>> feat <<b>b</b>><b>b</b></<b>b</b>> <<b>b</b>>o</<b>b</b>> <<b>b</b>><b>b</b></<b>b</b>>
And this only happens because there are "b"'s in the array.
Can you guys see anything that I could change to make it work properly?
You problem is that when your function goes through and looks for all the b's to bold it sees the bold tags and also tries to bold them as well.
#symcbean was close but forgot one thing.
$string = 'jessie j price tag feat b o b';
$words = array('jessie','tag','b','o','b');
print hl($string, $words);
function hl($inp, $words)
{
$replace=array_flip(array_flip($words)); // remove duplicates
$pattern=array();
foreach ($replace as $k=>$fword) {
$pattern[]='/\b(' . $fword . ')(?!>)\b/i';
$replace[$k]='<b>$1</b>';
}
return preg_replace($pattern, $replace, $inp);
}
Do you see this added "(?!>)" that is a negative look ahead assertion, basically it says only match if the string is not followed by a ">" which is what would be seen is opening bold and closing bold tags. Notice I only check for ">" after the string in order to exclude both the opening and closing bold tag as looking for it at the start of the string would not catch the closing bold tag. The above code works exactly as expected.
Your base problem is that you quite wildly replace plain text strings inside HTML. That does cause your problem for small strings as you replace text in tags and attributes as well.
Instead you need to apply your search and replace to the text between HTML texts only. Additionally you don't want to highlight inside another highlight as well.
To do such things, regular expressions are quite limited. Instead use a HTML parser, in PHP this is for example DOMDocument. With a HTML parser it is possible to search only inside the HTML text elements (and not other things like tags, attributes and comments).
You find a highlighter for text in a previous answer of mine with a detailed description how it works. The question is Ignore html tags in preg_replace and it is quite similar to your question so probably this snippet is helpful, it uses <span> instead of <b> tags:
$doc = new DOMDocument;
$doc->loadXML($str);
$xp = new DOMXPath($doc);
$anchor = $doc->getElementsByTagName('body')->item(0);
if (!$anchor)
{
throw new Exception('Anchor element not found.');
}
// search elements that contain the search-text
$r = $xp->query('//*[contains(., "'.$search.'")]/*[FALSE = contains(., "'.$search.'")]/..', $anchor);
if (!$r)
{
throw new Exception('XPath failed.');
}
// process search results
foreach($r as $i => $node)
{
$textNodes = $xp->query('.//child::text()', $node);
// extract $search textnode ranges, create fitting nodes if necessary
$range = new TextRange($textNodes);
$ranges = array();
while(FALSE !== $start = strpos($range, $search))
{
$base = $range->split($start);
$range = $base->split(strlen($search));
$ranges[] = $base;
};
// wrap every each matching textnode
foreach($ranges as $range)
{
foreach($range->getNodes() as $node)
{
$span = $doc->createElement('span');
$span->setAttribute('class', 'search_hightlight');
$node = $node->parentNode->replaceChild($span, $node);
$span->appendChild($node);
}
}
}
If you adopt it for multiple search terms, I would add an additional class with a number depending on the search term so you can nicely style it with CSS in different colors.
Additionally you should remove duplicate search terms and make the xpath expression aware to not look for text that is already part of an element that has the highlight span assigned.
If it were me I'd have used javascript.
But using PHP, since the problem only seems to be duplicate entries in the search, just remove them, also you can run preg_replace just once rather than multiple times....
$string = 'jessie j price tag feat b o b';
$words = array('jessie','tag','b','o','b');
print hl($string, $words);
function hl($inp, $words)
{
$replace=array_flip(array_flip($words)); // remove duplicates
$pattern=array();
foreach ($replace as $k=>$fword) {
$pattern[]='/\b(' . $fword . ')\b/i';
$replace[$k]='<b>$1<b>';
}
return preg_replace($pattern, $replace, $inp);
}

PHP RegEx (or Alt Method) for Anchor tags

Ok I have to parse out a SOAP request and in the request some of the values are passed with (or inside) a Anchor tag. Looking for a RegEx (or alt method) to strip the tag and just return the value.
// But item needs to be a RegEx of some sort, it's a field right now
if($sObject->list == 'item') {
// Split on > this should be the end of the right side of the anchor tag
$pieces = explode(">", $sObject->fields->$field);
// Split on < this should be the closing anchor tag
$piece = explode("<", $pieces[1]);
$fields_string .= $piece[0] . "\n";
}
item is a field name but I would like to make this a RegEx to check for the Anchor tag instead of a specific field.
PHP has a strip_tags() function.
Alternatively you can use filter_var() with FILTER_SANITIZE_STRING.
Whatever you do don't parse HTML/XML with regular expressions. It's really error-prone and flaky. PHP has at least 3 different parsers as standard (SimpleXML, DOMDocument and XMLReader spring to mind).
I agree with cletus, using RegEx on HTML is bad practice because of how loose HTML is as a language (and I moan about PHP being too loose...). There are just so many ways you can variate a tag that unless you know that the document is standards-compliant / strict, it is sometimes just impossible to do. However, because I like a challenge that distracts me from work, here's how you might do it in RegEx!
I'll split this up into sections, no point if all you see is a string and say, "Meh... It'll do..."! First we have the main RegEx for an anchor tag:
'#<a></a>#'
Then we add in the text that could be between the tags.
We want to group this is parenthesis, so we can extract the string, and the question mark makes the asterix wildcard "un-greedy", meaning that the first </a> that it comes accross will be the one it uses to end the RegEx.
'#<a>(.*?)</a>#'
Next we add in the RegEx for href="". We match the href=" as plain text, then an any-length string that does not contain a quotation mark, then the ending quotation mark.
'#<a href\="([^"]*)">(.*?)</a>#'
Now we just need to say that the tag is allowed other attributes. According to the specification, an attribute can contain the following characters: [a-zA-Z_\:][a-zA-Z0-9_\:\.-]*.
Allow an attribute multiple times, and with a value, we get: ( [a-zA-Z_\:][a-zA-Z0-9_\:\.-]*\="[^"]*")*.
The resulting RegEx (PCRE) is as following:
'#<a( [a-zA-Z_\:][a-zA-Z0-9_\:\.-]*\="[^"]*")* href\="([^"]*)"( [a-zA-Z_\:][a-zA-Z0-9_\:\.-]*\="[^"]*")*>(.*?)</a>#'
Now, in PHP, use the preg_match_all() function to grab all occurances in the string.
$regex = '#<a( [a-zA-Z_\:][a-zA-Z0-9_\:\.-]*\="[^"]*")* href\="([^"]*)"( [a-zA-Z_\:][a-zA-Z0-9_\:\.-]*\="[^"]*")*>(.*?)</a>#';
preg_match_all($regex, $str_containing_anchors, $result);
foreach($result as $link)
{
$href = $link[2];
$text = $link[4];
}
use simplexml and xpath to retrieve the desired nodes
If you don't have some kind of request<->class mapping you can extract the information with the DOM extension. The property textConent contains all the text of the context node and its descendants.
$sr = '<?xml version="1.0"?>
<SOAP:Envelope xmlns:SOAP="urn:schemas-xmlsoap-org:soap.v1">
<SOAP:Body>
<foo:bar xmlns:foo="urn:yaddayadda">
<fragment>
Mary had a
little lamb
</fragment>
</foo:bar>
</SOAP:Body>
</SOAP:Envelope>';
$doc = new DOMDocument;
$doc->loadxml($sr);
$xpath = new DOMXPath($doc);
$ns = $xpath->query('//fragment');
if ( 0 < $ns->length ) {
echo $ns->item(0)->nodeValue;
}
prints
Mary had a
little lamb
If you want to strip or extract properties from only specific tag, you should try DOMDocument.
Something like this:
$TagWhiteList = array(
// Example of WhiteList
'b', 'i', 'u', 'strong', 'em', 'a', 'img'
);
function getTextFromNode($Node, $Text = "") {
// No tag, so it is a text
if ($Node->tagName == null)
return $Text.$Node->textContent;
// You may select a tag here
// Like:
// if (in_array($TextName, $TagWhiteList))
// DoSomthingWithIt($Text,$Node);
// Recursive to child
$Node = $Node->firstChild;
if ($Node != null)
$Text = getTextFromNode($Node, $Text);
// Recursive to sibling
while($Node->nextSibling != null) {
$Text = getTextFromNode($Node->nextSibling, $Text);
$Node = $Node->nextSibling;
}
return $Text;
}
function getTextFromDocument($DOMDoc) {
return getTextFromNode($DOMDoc->documentElement);
}
To use:
$Doc = new DOMDocument();
$Doc->loadHTMLFile("Test.html");
$Text = getTextFromDocument($Doc);
echo "Text from HTML: ".$Text."\n";
The above function is how to strip tags. But you can modify it a bit to manipulate the element. For example, if the tag is 'a' of archor, you can extract its target and display it instead of the text inside.
Hope this help.

Categories