How to parse this kind of HTML code with using PHP? - php

First of all, I found some threads here on SO, for example here, but it's not exactly what I am looking for.
Here is a sample of text that I have:
Some text bla bla bla bla<br /><b>Date</b>: 2012-12-13<br /><br /><b>Name</b>: Peter Novak<br /><b>Hobby</b>: books,cinema,facebook
The desired output:
2012-12-13
Peter Novak
books,cinema,facebook
I need to save this information into our database, but I don't know, how to detect between the <b> tags the value (eg. Date) and then immediately the value (in this case : 2012-12-13)...
I would be grateful for every help with this, thank you!

Since there's not much DOM to traverse, there's not much a DOM traversal tool can do with this.
This should work:
1) Remove everything before the b tag.
2) Remove the b tags. A DOM traversal tool can do this, but if they are pure text, even a regex can do it, and it can remove the colon and the subsequent whitespace in the same pass: <b\s*>[^<]+</b\s*>:\s*
3) Change sequences of br tags to bare newlines (do you really want to?). The DOM traversal tool can do this, but so can regexes: (?:<br\s*/?>)+
$html = preg_replace('#^[^<]+#', "", $html);
$html = preg_replace('#<b\s*>[^<]+</b\s*>:\s*#', "", $html);
$html = preg_replace('#(?:<br\s*/?>)+#', "\n", $html);

If <b>Date</b>, <b>Name</b>, <b>Hobby</b> and the <br />'s will always be there in that way, I suggest you use strpos() and substr().
For instance, to get the date:
// Get start position, +13 because of "<b>Date</b>: "
$dateStartPos = strpos($yourText, "<b>Date</b>") + 13;
// Get end position, use dateStartPos as offset
$dateEndPos = strpos($yourText, "<br />", $dateStartPos);
// Cut out the date, the length is the end position minus the start position
$date = substr($yourText, $dateStartPos, ($dateEndPos - $dateStartPos));

Assuming that the format is consistent, then explode can work for you:
<?php
$text = "Some text bla bla bla bla<br /><b>Date</b>: 2012-12-13<br /><br /><b>Name</b>: Peter Novak<br /><b>Hobby</b>: books,cinema,facebook";
$tokenized = explode(': ', $text);
$tokenized[1] = explode("<br", $tokenized[1]);
$tokenized[2] = explode("<br", $tokenized[2]);
$tokenized[3] = explode("<br", $tokenized[3]);
$date = $tokenized[1][0];
$name = $tokenized[2][0];
$hobby = $tokenized[3][0];
echo $date;
echo $name;
echo $hobby;
?>

Using PHP Simple HTML DOM Parser you can achieve this easily (just like jQuery)
include('simple_html_dom.php');
$html = str_get_html('Some text bla bla bla bla<br /><b>Date</b>: 2012-12-13<br /><br /><b>Name</b>: Peter Novak<br /><b>Hobby</b>: books,cinema,facebook');
Or
$html = file_get_html('http://your_page.com/');
then
foreach($html->find('text') as $t){
if(substr($t, 0, 1)==':')
{
// do whatever you want
echo substr($t, 1).'<br />';
}
}
The output of the example is given below
2012-12-13
Peter Novak
books,cinema,facebook

Related

preg_replace href anchor with anchor text

how can i replace all the anchors with each anchor text . my code is
$body='<p>The man was dancing like a little boy while all kids were watching ... </p>';
i want the result to be :
<p>The man was dancing like a little boy while all kids were watching ... </p>
i used :
$body= preg_replace('#<a href="https?://(?:.+\.)?ok.co.*?>.*?</a>#i', '$1', $body);
and result is :
<p>The man was while all kids were watching ... </p>
Try this
$body='<p>The man was dancing like a little boy while all kids were watching ... </p>';
echo preg_replace('#<a.*?>([^>]*)</a>#i', '$1', $body);
Without regexes.....
<?php
$d = new DOMDocument();
$d->loadHTML('<p>The man was dancing like a little boy while all kids were watching ... </p>');
$x = new DOMXPath($d);
foreach($x->query('//a') as $anchor){
$url = $anchor->getAttribute('href');
$domain = parse_url($url,PHP_URL_HOST);
if($domain == 'www.example.com'){
$anchor->parentNode->replaceChild(new DOMText($anchor->textContent),$anchor);
}
}
function get_inner_html( $node ) {
$innerHTML= '';
$children = $node->childNodes;
foreach ($children as $child) {
$innerHTML .= $child->ownerDocument->saveXML( $child );
}
return $innerHTML;
}
echo get_inner_html($x->query('//body')[0]);
can use this code:
regex : /< a.*?>|<a.*?>|<\/a>/g
$body='<p>The man was dancing like a little boy while all kids were watching ... </p>';
echo preg_replace('/< a.*?>|<a.*?>|<\/a>/', ' ', $body);
test and show example match word: https://regex101.com/r/mgYjoB/1
You could simply use strip_tags() and htmlspecialchars() here.
strip_tags - Strip HTML and PHP tags from a string
htmlspecialchars - Convert special characters to HTML entities
Step 1: Use strip_tags() to strip all tags except the <p> tag.
Step 2: Since we need to obtain the string along with the HTML tags, we need to use htmlspecialchars().
echo htmlspecialchars(strip_tags($body, '<p>'));
When there's already an in-built PHP function, I think it's better and more compact to use that instead of using preg_replace

How to wrap every word in spans with PHP?

I have a some html paragraphs and I want to wrap every word in . Now I have
$paragraph = "This is a paragraph.";
$contents = explode(' ', $paragraph);
$i = 0;
$span_content = '';
foreach ($contents as $c){
$span_content .= '<span>'.$c.'</span> ';
$i++;
}
$result = $span_content;
The above codes work just fine for normal cases, but sometimes the $paragraph would contains some html tags, for example
$paragraph = "This is an image: <img src='/img.jpeg' /> This is a <a href='/abc.htm'/>Link</a>'";
How can I not wrap "words" inside html tag so that the htmnl tags still works but have the other words wrapped in spans? Thanks a lot!
Some (*SKIP)(*FAIL) mechanism?
<?php
$content = "This is an image: <img src='/img.jpeg' /> ";
$content .= "This is a <a href='/abc.htm'/>Link</a>";
$regex = '~<[^>]+>(*SKIP)(*FAIL)|\b\w+\b~';
$wrapped_content = preg_replace($regex, "<span>\\0</span>", $content);
echo $wrapped_content;
See a demo on ideone.com as well as on regex101.com.
To leave out the Link as well, you could go for:
(?:<[^>]+> # same pattern as above
| # or
(?<=>)\w+(?=<) # lookarounds with a word
)
(*SKIP)(*FAIL) # all of these alternatives shall fail
|
(\b\w+\b)
See a demo for this on on regex101.com.
The short version is you really do not want to attempt this.
The longer version: If you are dealing with HTML then you need an HTML parser. You can't use regexes. But where it becomes even more messy is that you are not starting with HTML, but with an HTML fragment (which may, or may not be well-formed. It might work if Hence you need to use an HTML praser to identify the non-HTML extents, separate them out and feed them into a secondary parser (which might well use regexes) for translation, then replace the translted content back into the DOM before serializing the document.

PHP - Replace Word in String (while ignore HTML tags)

I have a string with HTML tags, $paragraph:
$paragraph = '
<p class="instruction">
<sup id="num1" class="s">80</sup>
Hello there and welcome to Stackoverflow! You are welcome indeed.
</p>
';
$replaceIndex = array(0, 4);
$word = 'dingo';
I'd like to replace the words at indices defined by $replaceIndex (0 and 4) of $paragraph. By this, I mean I want to replace the words "80" and "welcome" (only the first instance) with $word. The paragraph itself may be formatted with different HTML tags in different places.
Is there a way to locate and replace certain words of the string while virtually ignoring (but not stripping) HTML tags?
Thanks!
Edit: Words are separated by (multiple) tags and (multiple) whitespace characters, while not including anything within the tags.
Thanks for all the tips. I figured it out! Since I'm new to PHP, I'd appreciate it if any PHP veterans have any tips on simplifying the code. Thanks!
$paragraph = '
<p class="instruction">
<sup id="num1" class="s">80</sup>
Hello there and welcome to Stackoverflow! You are welcome indeed.
</p>
';
// Split up $paragraph into an array of tags and words
$paragraphArray = preg_split('/(<.*?>)|\s/', $paragraph, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);
$wordIndicies = array(0, 4);
$replaceWith = 'REPLACED';
foreach ($wordIndicies as $wordIndex) {
for ($i = 0; $i <= $wordIndex; $i++) {
// if this element starts with '<', element is a tag.
if ($paragraphArray[$i]{0} == '<') {
// push wordIndex forward to compensate for found tag element
$wordIndex++;
}
// when we reach the word we want, replace it!
elseif ($i == $wordIndex) {
$paragraphArray[$i] = $replaceWith;
}
}
}
// Put the string back together
$newParagraph = implode(' ', $paragraphArray);
// Test output!
echo(htmlspecialchars($newParagraph));
*Only caveat is that this may potentially produce unwanted spaces in $newParagraph, but I'll see if that actually causes any issues when I implement the code.
$text = preg_replace('/\b80\b|\bwelcome\b/', $word, $paragraph);
Hope this will help you :)
SimpleXML could come in handy as well:
$paragraph = '
<p class="instruction">
<sup id="num1" class="s">80</sup>
Hello there and welcome to Stackoverflow! You are welcome indeed.
</p>
';
$xml = simplexml_load_string($paragraph);
$xml->sup = $word;

Remove from string

I have the following that I need removed from string in loop.
<comment>Some comment here</comment>
The result is from a database so the the content inside the comment tag is different.
Thanks for the help.
Figured it out. The following seems to do the trick.
echo preg_replace('~\<comment>.*?\</comment>~', '', $blog->comment);
This may be overkill, but you can use DOMDocument to parse the string as HTML, then remove the tags.
$str = 'Test 123 <comment>Some comment here</comment> abc 456';
$dom = new DOMDocument;
// Wrap $str in a div, so we can easily extract the HTML from the DOMDocument
#$dom->loadHTML("<div id='string'>$str</div>"); // It yells about <comment> not being valid
$comments = $dom->getElementsByTagName('comment');
foreach($comments as $c){
$c->parentNode->removeChild($c);
}
$domXPath = new DOMXPath($dom);
// $dom->getElementById requires the HTML be valid, and it's not here
// $dom->saveHTML() adds a DOCTYPE and HTML tag, which we don't need
echo $domXPath->query('//div[#id="string"]')->item(0)->nodeValue; // "Test 123 abc 456"
DEMO: http://codepad.org/wfzsmpAW
If this is only a matter of removing the <comment /> tag, a simple preg_replace() or a str_replace() will do:
$input = "<comment>Some comment here</comment>";
// Probably the best method str_replace()
echo str_replace(array("<comment>","</comment>"), "", $input);
// some comment here
// Or by regular expression...
echo preg_replace("/<\/?comment>/", "", $input);
// some comment here
Or if there are other tags in there and you want to strip out all but a few, use strip_tags() with its optional second parameter to specify allowable tags.
echo strip_tags($input, "<a><p><other_allowed_tag>");

I need to split text delimited by paragraph tag

$text = "<p>this is the first paragraph</p><p>this is the first paragraph</p>";
I need to split the above into an array delimited by the paragraph tags. That is, I need to split the above into an array with two elements:
array ([0] = "this is the first paragraph", [1] = "this is the first paragraph")
Remove the closing </p> tags as we don't need them and then explode the string into an array on opening </p> tags.
$text = "<p>this is the first paragraph</p><p>this is the first paragraph</p>";
$text = str_replace('</p>', '', $text);
$array = explode('<p>', $text);
To see the code run please see the following codepad entry. As you can see this code will leave you with an empty array entry at index 0. If this is a problem then it can easily be removed by calling array_shift($array) before using the array.
For anyone else who finds this, don't forget that a P tag may have styles, id's or any other possible attributes so you should probably look at something like this:
$ps = preg_split('#<p([^>])*>#',$input);
This is an old question but I was not able to find any reasonable solution in an hour of looking for stactverflow answers. If you have string full of html tags (p tags) and if you want to get paragraphs (or first paragraph) use DOMDocument.
$long_description is a string that has <p> tags in it.
$long_descriptionDOM = new DOMDocument();
// This is how you use it with UTF-8
$long_descriptionDOM->loadHTML((mb_convert_encoding($long_description, 'HTML-ENTITIES', 'UTF-8')));
$paragraphs = $long_descriptionDOM->getElementsByTagName('p');
$first_paragraph = $paragraphs->item(0)->textContent();
I guess that this is the right solution. No need for regex.
edit: YOU SHOULD NOT USE REGEX TO PARSE HTML.
$text = "<p>this is the first paragraph</p><p>this is the first paragraph</p>";
$exptext = explode("<p>", $text);
echo $exptext[0];
echo "<br>";
echo $exptext[1];
//////////////// OUTPUT /////////////////
this is the first paragraph
this is the first paragraph
Try this code:
<?php
$textArray = explode("<p>" $text);
for ($i = 0; $i < sizeof($textArray); $i++) {
$textArray[$i] = strip_tags($textArray[$i]);
}
If your input is somewhat consistent you can use a simple split method as:
$paragraphs = preg_split('~(</?p>\s*)+~', $text, PREG_SPLIT_NO_EMPTY);
Where the preg_split will look for combinations of <p> and </p> plus possible whitespace and separate the string there.
As unnecessary alternative you can also use querypath or phpquery to extract only complete paragraph contents using:
foreach (htmlqp($text)->find("p") as $p) { print $p->text(); }
Try the following:
<?php
$text = "<p>this is the first paragraph</p><p>this is the first paragraph</p>";
$array;
preg_replace_callback("`<p>(.+)</p>`isU", function ($matches) {
global $array;
$array[] = $matches[1];
}, $text);
var_dump($array);
?>
This can be modified, putting the array in a class that manage it with an add value method, and a getter.
Try this.
<?php
$text = "<p>this is the first paragraph</p><p>this is the first paragraph</p>";
$array = json_decode(json_encode((array) simplexml_load_string('<data>'.$text.'</data>')),1);
print_r($array['p']);
?>

Categories