Pretty simple i'm sure, but..
I've got a file that is guaranteed to have only an <h1>some text</h1> and a <p>some more text</p> in it.
How would i go about returning these to elements as separate variables?
If your file is an HTML one, the general solution would be to :
Load it to a DOMDocument, with
DOMDocument::loadHTML if you have your HTML content as a string
or DOMDocument::loadHTMLFile
Use DOM methods to access your nodes
Here, DOMDocument::getElementsByTagName should be perfect
ANd, then, once you have your node, work's done ;-)
Not : if your HTML elements contain sub-elements, and you want the whole content, including sub-tags, as a string, take a look at, for example, this user note
Your file is just text, so you're going to have to parse it. Generally HTML isn't all that suitable for parsing with normal operations, but if you know the exact contents you shouldn't have a problem.
Depending on what your separator is between the two tag blocks (let's pretend it's a \n, you could do something like this:
$contents = file_get_contents("yourfile.html");
list($h1,$p) = explode("\n",$contents);
That would give you the two text blocks in $h1 and $p. You could parse the rest from there if you needed to do more work.
You can use something like this:
function strBetween($au, $au2, $text) {//gets substring beetween $au and $au2 in $text
$pau = strpos($text, $au);
if($au2 !== '') {
$pau2 = strpos($text, $au2,$pau);
if($pau !== false && $pau2 !== false)
return substr($text, $pau+strlen($au), $pau2-$pau-strlen($au));
else
return '';
} else {
return substr($text, $pau+strlen($au));
}
}
$contents = file_get_contents("yourfile.html");
$h1 = strBetween('<h1>', '</h1>', $contents);
$p = strBetween('<p>', '</p>', $contents);
Related
I am trying to redo some forms that have uppercase field names and spaces, there are hundreds of fields and 50 + forms... I decided to try to write a PHP script that parses through the HTML of the form.
So now I have a textarea that I will post the html into and I want to change all the field names from
name="Here is a form field name"
to
name="here_is_a_form_field_name"
How in one command could I parse through and change it so all in the name tags would be lowercase and spaces replace with underscores
I am assuming preg_replace with an expression?
Thanks!
I would suggest not using regex for manipulation of HTML .. I would use DOMDocument instead, something like the following
$dom = new DOMDocument();
$dom->loadHTMLFile('filename.html');
// loop each textarea
foreach ($dom->getElementsByTagName('textarea') as $item) {
// setup new values ie lowercase and replacing space with underscore
$newval = $item->getAttribute('name');
$newval = str_replace(' ','_',$newval);
$newval = strtolower($newval);
// change attribute
$item->setAttribute('name', $newval);
}
// save the document
$dom->saveHTML();
An alternative would be to use something like Simple HTML DOM Parser for the job - there are some good examples on the linked site
I agree that preg_replace() or rather preg_replace_callback() is the right tool for the job, here's an example of how to use it for your task:
preg_replace_callback('/ name="[^"]"/', function ($matches) {
return str_replace(' ', '_', strtolower($matches[0]))
}, $file_contents);
You should, however, check the results afterwards using a diff tool and fine-tune the pattern if necessary.
The reason why I would recommend against a DOM parser is that they usually choke on invalid HTML or files that contain for example tags for templating engines.
This is your Solution:
<?php
$nameStr = "Here is a form field name";
while (strpos($nameStr, ' ') !== FALSE) {
$nameStr = str_replace(' ', '_', $nameStr);
}
echo $nameStr;
?>
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to parse and process HTML with PHP?
I'm pretty new to PHP.
I have the text of a body tag of some page in a string variable.
I'd like to know if it contains some tag ... where the tag name tag1 is given, and if so, take only that tag from the string.
How can I do that simply in PHP?
Thanks!!
You would be looking at something like this:
<?php
$content = "";
$doc = new DOMDocument();
$doc->load("example.html");
$items = $doc->getElementsByTagName('tag1');
if(count($items) > 0) //Only if tag1 items are found
{
foreach ($items as $tag1)
{
// Do something with $tag1->nodeValue and save your modifications
$content .= $tag1->nodeValue;
}
}
else
{
$content = $doc->saveHTML();
}
echo $content;
?>
DomDocument represents an entire HTML or XML document; serves as the root of the document tree. So you will have a valid markup, and by finding elements By Tag Name you won't find comments.
Another possibility is regex.
$matches = null;
$returnValue = preg_match_all('#<li.*?>(.*?)</li>#', 'abc', $matches);
$matches[0][x] contains the whole matches such as <li class="small">list entry</li>, $matches[1][x] containt the inner HTML only such as list entry.
Fast way:
Look for the index position of tag1 then look for the index position of /tag1. Then cut the string between those two indexes. Look up strpos and substr on php.net
Also this might not work if your string is too long.
$pos1 = strpos($bigString, '<tag1>');
$pos2 = strpos($bigString, '</tag1>');
$resultingString = substr($bigString, -$pos1, $pos2);
You might have to add and/or substract some units from $pos1 and $pos2 to get the $resultingString right.
(if you don't have comments with tag1 inside of them sigh)
The right way:
Look up html parsers
$regex = '#<p.+</p>#s';
My objective is to return the large string that occurs between the first paragraph tag, and the last paragraph tag. This is to include everything, even other paragraphs.
My regex above works for everything EXCEPT the paragraph tags. I tested it replacing the 'p' with 'html' and returned success, replaced with 'script' and returned success... Why would this return true for those cases but not for the paragraph?
I am still working on this, and relatively convinced that there is no strange escape sequence that is causing the regex to stop... I think this because I can extract everything between the first and last 'html' tag. The text between the 'html' tags also contains all of the 'p' tags that I am failing to extract. If there were some kind of escape or error, I think it would also throw the same error when extracting for the 'html' tags. I have tried preg_quote() with no success.
Perhaps I need to set memory devoted to regex processing higher so that it can process the whole document?
Update: In most cases the leading 'p' will (in most cases) NOT be the ending '/p' tag for the same paragraph tag.
Update: The returned results will be something akin to:
<p>this is the first tag</p>this is a bunch of text from the document, could be all manner of tags <p>this is the last paragraph tag</p>
Update: Code example
$htmlArticle = <<< 'ENDOFHTML'
Insert data from pastebin here
http://pastebin.com/4A3FYGc8
ENDOFHTML;
$pattern = '#<html.+/html>#s'; // Works fine, returns all characters between first <html and last /html
$pattern = '#<script.+/script>#s'; // Works fine, same as above
$pattern = '#<p.+/p>#s'; // Returns nothing, nothing at all. :'(
preg_match($pattern, $htmlArticle, $matches);
var_dump($matches);
?>
Solution:
ini_set('pcre.backtrack_limit', '1000000');
I had exhausted my backtrack limit. This is a setting in your php.ini file, and can be set in code with ini_set(). Curiously, I set the value with ini_set() to match that in my php.ini file... So it should have worked from the start. --- Thanks coming as soon as I can post a solution.
That is very curious. It's not returning an error, and using a shorter document seems to return a match. I can't understand why this would happen. I've used regexes on enormous documents without trouble.
Note that this produces a match: #<p\b.+<\#s
Perhaps try playing with the backtrack limit, since there are many </p> matches. However if the limit were too low I would expect preg_match to return False, not 0!
As a workaround, try this instead:
function extractBetweenPs($data) {
$startoffset = null;
$endoffset = null;
if (preg_match('/<p\b/', $data, $matches, PREG_OFFSET_CAPTURE)) {
$startoffset = $matches[0][1];
$needle = '</p>';
$endoffset = strrpos($data, $needle);
if ($endoffset !== FALSE) {
$endoffset += strlen($needle);
} else {
// this will return everything from '<p' to the end of the doc
// if there is no '</p>'
// maybe not what you want?
$endoffset = strlen($data);
}
return substr($data, $startoffset, $endoffset-$startoffset);
}
return '';
}
That said, this is a very strange requirement--treating an arbitrary section of a structured document as a blob. Maybe you could step back and say what your broader goal is and we can suggest another approach?
Regex is not a tool that can be used to correctly parse HTML.
All you need is DOMDocument
$dom = new DOMDocument();
$dom->loadHTML($your_html);
$node = $dom->getElementsByTagName('p')->item(0);
$dom2 = new DOMDocument();
$node = $dom2->importNode($node, true);
$dom2->appendChild($node);
echo $dom2->saveHTML();
I have code with several lines like this
<p> <inset></p>
Where there may be any number of spaces or tabs (or none) between the opening <p> tag and the rest if the string. I need to replace these, but I can't get it to work.
I thought this would do it, but it doesn't work:
<p>[ \t]+<inset></p>
Try this:
$html = preg_replace('#(<p>)\s+(<inset></p>)#', '$1$2', $html);
If you want true text-trimming for HTML including everything you can encounter like those entitites, comments, child-elements and all that stuff, you can make use of a TextRangeTrimmer and TextRange:
$htmlFragment = '<p> <inset></p>';
$dom = new DOMDocument();
$dom->loadHTML($htmlFragment);
$parent = $dom->getElementsByTagName('body')->item(0);
if (!$parent)
{
throw new Exception('Parent element not found.');
}
$range = new TextRange($parent);
$trimmer = new TextRangeTrimmer($range);
$trimmer->ltrim();
// inner HTML (PHP >= 5.3.6)
foreach($parent->childNodes as $node)
{
echo $dom->saveHTML($node);
}
Output:
<p><inset></p>
I've both classes in a gist: https://gist.github.com/1894360/ (codepad viper is down).
See as well the related questions / answers:
Wordwrap / Cut Text in HTML string
Ignore html tags in preg_replace
Try to load your HTML string into a DOM tree instead, and then trim all the text values in the tree.
http://php.net/domdocument.loadhtml
http://php.net/trim
Ok I have to parse out a SOAP request and in the request some of the values are passed with (or inside) a Anchor tag. Looking for a RegEx (or alt method) to strip the tag and just return the value.
// But item needs to be a RegEx of some sort, it's a field right now
if($sObject->list == 'item') {
// Split on > this should be the end of the right side of the anchor tag
$pieces = explode(">", $sObject->fields->$field);
// Split on < this should be the closing anchor tag
$piece = explode("<", $pieces[1]);
$fields_string .= $piece[0] . "\n";
}
item is a field name but I would like to make this a RegEx to check for the Anchor tag instead of a specific field.
PHP has a strip_tags() function.
Alternatively you can use filter_var() with FILTER_SANITIZE_STRING.
Whatever you do don't parse HTML/XML with regular expressions. It's really error-prone and flaky. PHP has at least 3 different parsers as standard (SimpleXML, DOMDocument and XMLReader spring to mind).
I agree with cletus, using RegEx on HTML is bad practice because of how loose HTML is as a language (and I moan about PHP being too loose...). There are just so many ways you can variate a tag that unless you know that the document is standards-compliant / strict, it is sometimes just impossible to do. However, because I like a challenge that distracts me from work, here's how you might do it in RegEx!
I'll split this up into sections, no point if all you see is a string and say, "Meh... It'll do..."! First we have the main RegEx for an anchor tag:
'#<a></a>#'
Then we add in the text that could be between the tags.
We want to group this is parenthesis, so we can extract the string, and the question mark makes the asterix wildcard "un-greedy", meaning that the first </a> that it comes accross will be the one it uses to end the RegEx.
'#<a>(.*?)</a>#'
Next we add in the RegEx for href="". We match the href=" as plain text, then an any-length string that does not contain a quotation mark, then the ending quotation mark.
'#<a href\="([^"]*)">(.*?)</a>#'
Now we just need to say that the tag is allowed other attributes. According to the specification, an attribute can contain the following characters: [a-zA-Z_\:][a-zA-Z0-9_\:\.-]*.
Allow an attribute multiple times, and with a value, we get: ( [a-zA-Z_\:][a-zA-Z0-9_\:\.-]*\="[^"]*")*.
The resulting RegEx (PCRE) is as following:
'#<a( [a-zA-Z_\:][a-zA-Z0-9_\:\.-]*\="[^"]*")* href\="([^"]*)"( [a-zA-Z_\:][a-zA-Z0-9_\:\.-]*\="[^"]*")*>(.*?)</a>#'
Now, in PHP, use the preg_match_all() function to grab all occurances in the string.
$regex = '#<a( [a-zA-Z_\:][a-zA-Z0-9_\:\.-]*\="[^"]*")* href\="([^"]*)"( [a-zA-Z_\:][a-zA-Z0-9_\:\.-]*\="[^"]*")*>(.*?)</a>#';
preg_match_all($regex, $str_containing_anchors, $result);
foreach($result as $link)
{
$href = $link[2];
$text = $link[4];
}
use simplexml and xpath to retrieve the desired nodes
If you don't have some kind of request<->class mapping you can extract the information with the DOM extension. The property textConent contains all the text of the context node and its descendants.
$sr = '<?xml version="1.0"?>
<SOAP:Envelope xmlns:SOAP="urn:schemas-xmlsoap-org:soap.v1">
<SOAP:Body>
<foo:bar xmlns:foo="urn:yaddayadda">
<fragment>
Mary had a
little lamb
</fragment>
</foo:bar>
</SOAP:Body>
</SOAP:Envelope>';
$doc = new DOMDocument;
$doc->loadxml($sr);
$xpath = new DOMXPath($doc);
$ns = $xpath->query('//fragment');
if ( 0 < $ns->length ) {
echo $ns->item(0)->nodeValue;
}
prints
Mary had a
little lamb
If you want to strip or extract properties from only specific tag, you should try DOMDocument.
Something like this:
$TagWhiteList = array(
// Example of WhiteList
'b', 'i', 'u', 'strong', 'em', 'a', 'img'
);
function getTextFromNode($Node, $Text = "") {
// No tag, so it is a text
if ($Node->tagName == null)
return $Text.$Node->textContent;
// You may select a tag here
// Like:
// if (in_array($TextName, $TagWhiteList))
// DoSomthingWithIt($Text,$Node);
// Recursive to child
$Node = $Node->firstChild;
if ($Node != null)
$Text = getTextFromNode($Node, $Text);
// Recursive to sibling
while($Node->nextSibling != null) {
$Text = getTextFromNode($Node->nextSibling, $Text);
$Node = $Node->nextSibling;
}
return $Text;
}
function getTextFromDocument($DOMDoc) {
return getTextFromNode($DOMDoc->documentElement);
}
To use:
$Doc = new DOMDocument();
$Doc->loadHTMLFile("Test.html");
$Text = getTextFromDocument($Doc);
echo "Text from HTML: ".$Text."\n";
The above function is how to strip tags. But you can modify it a bit to manipulate the element. For example, if the tag is 'a' of archor, you can extract its target and display it instead of the text inside.
Hope this help.