This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 3 years ago.
I have the following html
<!-- START: .paragraph-content -->
<div class="paragraph-content">
<div class="container"><div class="row"><div class="col-sm-10">
<!-- START: .paragraph-columns -->
<div class="paragraph-columns">
<div class="field-wysiwyg">
<div data-quickedit-field-id="paragraph/167/field_mt_body/en/default" class="field field--name-field-mt-body field--type-text-long field--label-hidden field__items">
<div class="field__item">
<h2> </h2>
<h2> </h2>
<h2>INNOVATION.</h2>
<p> </p>
<p> </p>
<p> </p>
<p> </p>
</div>
</div>
</div>
</div>
<!-- END: .paragraph-columns -->
</div></div></div>
</div>
<!-- END: .paragraph-content -->
I want to capture where the html begins with <div class="paragraph-content">
in that block, I want to change the <h2> to <h1>
so the end result will look like this:
<!-- START: .paragraph-content -->
<div class="paragraph-content">
<div class="container"><div class="row"><div class="col-sm-10">
<!-- START: .paragraph-columns -->
<div class="paragraph-columns">
<div class="field-wysiwyg">
<div data-quickedit-field-id="paragraph/167/field_mt_body/en/default" class="field field--name-field-mt-body field--type-text-long field--label-hidden field__items">
<div class="field__item">
<h2> </h2>
<h2> </h2>
<h1>INNOVATION.</h1>
<p> </p>
<p> </p>
<p> </p>
<p> </p>
</div>
</div>
</div>
</div>
<!-- END: .paragraph-columns -->
</div></div></div>
</div>
<!-- END: .paragraph-content -->
I have tried it with this regex pattern but nothing works:
'/(?:<h2((?!\s").*?)?>)(.*?)(?:<\/h2>)/si'
If you have the HTML page as a string variable, accomplished by:
$fileStr = file_get_contents('HTML_FILE.htm');
You can then find the start of the section you are after by using the text "<!-- START: .paragraph-content -->" and the end of the section of the string by using the text "<!-- END: .paragraph-content -->".
Having the start and end of the string, we can extract the portion of the $fileStr in which we want to run our regular expression against.
The regular expression required to find the string you want to change is:
<h2>.{2,}<\/h2>
The issue you have to to extract and replace the <h2> and </h2> with <h1> and </h1> whilst retaining everything in between these.
Doing that isn't going to be a simple neat solution. I would do a loop which would look for <h2>, then find if there is any alphanumerics between that and the closing </h2>, then extract the contents between the two if there is, replacing the tags appropriately.
Whilst not providing you with code to cut and paste, I hope I've given you something to ponder.
Regex works as a finite state machine, it has no way to parse recursive things, like XML tags that might contain other XML tags.
Basically, you cant match exactly the closing tag that matches the opening tag, because that requires recursion, which is not possible in finite state machines (there is Python module regex that has recursion and some other implementations, but this is not true regex).
For your problem exaclty you need a whole top-down recursive parser or some tool that works with XML/HTML specifically.
Just replacing the h2 tags with h1 in the whole regex'ed string is as simple as <(/?)h2> -> <$1h1> though.
Related
I'm writing some PHP that will scrape a webpage and return a very small value from it when it exists.
The HTML that I will receive sometimes looks like:
<!-- message -->
<div id="post_message_5400147">
<!-- BEGIN TEMPLATE: ad_showthread_firstpost_start -->
<!-- END TEMPLATE: ad_showthread_firstpost_start -->
ss:<font size="5"><b><font size="5"><font size="5"> U71</font></font></b></font>
</div>
<!-- / message -->
Sometimes it will look like:
<!-- message -->
<div id="post_message_5400147">
ss:<font size="5"><b><font size="5"><font size="5"> U71</font></font></b></font>
</div>
<!-- / message -->
And sometimes it will look like:
<div id="post_message_5400752">
Bonus code: SKATE
</div>
<!-- / message -->
The difference being the the '<!-- BEGIN TEMPLATE....' stuff in the first example, and the "Bonus code: ..." text in the third.
What I want the regex to do is only return '<!-- message's where the text is "ss:[...]" (sometimes it can also be "ss=[...]"). Ideally it would strip out all of the extraneous HTML and just return the 3 character seat ("U71" in the example; always in the form of LETTERnumnum) but I don't really care too much about that as I can always strip_tags() it out later.
So far, this is what I've been able to figure out (I'm very new to regex) but it doesn't ignore the "Bonus code:[...]" entries:
preg_match('/.*<!-- message -->\s*<div id="post_message_[0-9]{7}">\s*(.*?)<!-- \/ message -->/s', $html, $matches);
Can anyone tell me how to do this more elegantly since obviously I'm not doing it right?
You could use something simple like this:
$p = '/> (\w{1}\d{2})</';
example:
$s = <<<EOT
<!-- message -->
<div id="post_message_5400147">
<!-- BEGIN TEMPLATE: ad_showthread_firstpost_start -->
<!-- END TEMPLATE: ad_showthread_firstpost_start -->
ss:<font size="5"><b><font size="5"><font size="5"> U71</font></font></b></font>
</div>
<!-- / message -->
EOT;
$p = '/> (\w{1}\d{2})</';
preg_match($p,$s,$m);
var_dump($m[1]);
output:
U71
To find all matches and get the last one you can do something like:
$s = "<test> U71</test>some junk here <span> Z23</span>";
$p = '/> (\w{1}\d{2})</';
preg_match_all($p,$s,$m);
var_dump(end($m[1]));
output:
Z23
The end() function will advance the array's internal pointer to the last element, and returns its value.
I need a way to use wrap all <p> tags until the last closing </p> tag before a tag change. So Every P tag that has either no tag before it, or a tag other than <p> would start the match. Every tag ending in </p> but followed by a tag that is not a paragraph marks the end of that match.
I tried using this:
$content = preg_replace( "/(<(p|ul)>[\s\S]*?(?=<h\d.*?>|<\/ul>))/Si", '<div class="content-block">$0</div>', $content );
but that only works if the paragraph tag is between header tags. I need something more flexible. Here is an example of what I mean (sorry if this is rough, not sure how to visually portray what I need):
<div class="wrapper">
<p></p>
<p></p>
</div>
<h2>Information<h2>
<div class="wrapper">
<p></p>
<p></p>
<p></p>
</div>
<h2>Another Header Here</h2>
<div class="wrapper">
<p></p>
<p></p>
</div>
<h3>Header Three</h3>
<div class="wrapper">
<p></p>
<ul>List Item</ul>
<p></p>
</div>
Figured it out myself. I needed a more generic selector, to select everything except specific tags. Here is what I came up with:
$content = preg_replace( "/((<p|<ul|<ol|<blockquote)(?:(?!\n<h|\n<table).)*)/sm", '<div class="content-block">$0</div>', $content );
i have this type of code:
<div class="content">
<p></p>
<p></p>
<p></p>
</div>
<div class="content">
<p></p>
<p></p>
<p></p>
</div>
i wish to select all p elements from the first element with the class content.
i managed to select the first class by using:
(//div[#class="content"])[1]
but using (//div[#class="content"])[1]/p it still shows both classes
Here's an working example using PHP's SimpleXML. I've made some small changes to the HTML code you provided so the output would be more meaningful.
Regarding the XPath expression you provided I just removed the parenthesis and it all worked as expected.
NOTE: Following #LarsH's comment, I reverted the XPath expression as it was OK for starters. I took the liberty to update it based on its example.
<?php
$html = <<<HTML
<body>
<div class="content">
<p>1</p>
<p>2</p>
<p>3</p>
</div>
<div class="content">
<p>4</p>
<p>5</p>
<p>6</p>
</div>
<div>
<div class="content">
<p>7</p>
<p>8</p>
<p>9</p>
</div>
</div>
</body>
HTML;
$sxe = new SimpleXMLElement($html);
foreach ($sxe->xpath('(//div[#class="content"])[1]/p') as $p) {
echo "$p\n";
}
Output:
1
2
3
Link to codepad working example.
Having a bit of an issue with PHP thats contained inside HTML thats inside PHP.
I have a script running that's using an SQL query to obtain Titles, Storys, urls (etc.)
Then
<?php
#Random code here to obtain SQL results
#$id = $row['id'];
#$story_Title["$id"] = $row['story_Title'];
#end of random code block for reference
echo '<!-- BEGIN content -->
<div id="content">
<div class="post">
<p class="details1">$date["{$id}"]/(echo $story_Title["{$id}"];)
The code :
$date["{$id}"]
Is printing to the web page literally , as apposed to returning the results requested.
Is there anyway to get around this , or is it not possible.
If not possible, what would be a better solution?
Concatenate the string with .:
<?php
echo '
<!-- BEGIN content -->
<div id="content">
<div class="post">
<p class="details1">'.$date[$row['id']].'/'.$story_Title[$row['id']].'</p>
</div>
</div>';
?>
Or you can break in and out of PHP like:
<!-- BEGIN content -->
<div id="content">
<div class="post">
<p class="details1">
<?php
echo $date[$row['id']].'/'.$story_Title[$row['id']];
?>
</p>
</div>
</div>
Read about, String Operators
There are two string operators. The first is the concatenation
operator ('.'), which returns the concatenation of its right and left
arguments. The second is the concatenating assignment operator ('.='),
which appends the argument on the right side to the argument on the
left side.
Anything wrong with this?
<?php
// some php code here...
?>
<!-- BEGIN content -->
<div id="content">
<div class="post">
<p class="details1">
<?php
echo $date[$row['id']] . '/' . $story_Title[$row['id']];
?>
</p>
</div>
</div>
PHP is an HTML 'templating' language as well as a programming language.
When you have sizable amounts of HTML to send then i suggest you start using PHP in that 'mode'. Never echo 'lots' of HTML in 'chunks'. Just drop out of PHP mode and 'switch' in and out of PHP mode as required. It is a lot easier...
You are also working rather hard when accessing stuff. All the PHP code in a script is linked together so your '$row' variable is available with direct access from mostly everywhere.
the '<?= ' is short for '<?php echo ', it is always available.
If you want to do 'control statements' like 'if, foreach etc' then look at the control-structures.alternative-syntax
Tested code:
<?php
#Random code here to obtain SQL results
$row = array('id' => 1, 'story_Title' => 'how to do stuff!', 'date' => '1960-04-01');
#end of random code block for reference
?>
<!-- BEGIN content -->
<html>
<body>
<div id="content">
<div class="post">
<p class="details1"><?= $row['date']?>/(<?= $row['story_Title']?>)</p>
</div>
</div>
</body>
</html>
Output from the above code:
1960-04-01/(how to do stuff!)
You need to make sure variables inside single quotes are actually not within single quotes.
echo '<!-- BEGIN content -->
<div id="content">
<div class="post">
<p class="details1">' . $date["{$id}"] . '</p';
See this for detailed explanation
// inside a loop run this
echo '<!-- BEGIN content -->
<div id="content">
<div class="post">
<p class="details1">',
$row['date'], $row['id']
'<h1>', $row['story_Title'] , '</h1>',
'</p>
</div>
</div>';
so i need to take the whole div with class "1" but it stops at the div class "1.1" ending so i want to get from this:
<head>
</head>
<body>
<div class="1">
<p>blah blah blah</p>
<div class="1.1">
trolololol
</div>
<div class="1.2">
trolo2lolo
</div>
</div>
</body>
only this:
<div class="1">
<p>blah blah blah</p>
<div class="1.1">
trolololol
</div>
<div class="1.2">
trolo2lolo
</div>
</div>
but for now i get only:
<div class="1">
<p>blah blah blah</p>
<div class="1.1">
trolololol
</div>
Regexp are not that intelligent to count how many tags you have opened and need to be closed before stopping the match. It stops at the first occurence of </div>. Try to use a real html parser if you want to access tags as real tags and not strings.
Regular expressions should not be used to parse documents like XML, HTML, "BBCode", JSON...
You should look for a real DOM parser, for example PHP's DOM extension