Change one part of the selection with Regex - php

I would like to insert a class with regex and preg_replace
echo preg_replace("/<li\>\s*<p\>[a-z]\)\s/", "/<li class=\"inciso\"\>\s*<p\>[a-z]\)\s/", $documento);
This is the model of text that I haveEste é o modelo das linhas do meu documento:
<li>
<p>a) long text</p>
</li>
<li>
<p>b) long text</p>
</li>
<li>
<p>c) long text</p>
</li>
New example, let´s say that is not a HTML, is just a simple list, and you wanna a change from this:
a) long text
b) long text
c) long text
To this:
a) new text long text
b) new text long text
c) new text long text
echo preg_replace("/[a-z]\)\s/", "/[a-z]\)\snew\stext/", $documento);
Is this correct?

IF, and I emphasize again IF, the input text you have is like the one you posted here, then you can assume you can find a safe pattern to replce, as you won't see this pattern somewhere else:
preg_replace("/<li>/", "<li class=\"inciso\"\/>", $documento);
This will replace every occurrence of <li> with the modified version. If there are <li> that you won't replace then it becomes more difficult and you should use a DOM or SAX parser
UPDATE after your update: You can match a word and add something before it with:
preg_replace("(long)", "new text $1", $documento);
Have a look at backreferences

use str_replace instead.
$find = '<li>';
$replace = '<li class="inciso">';
echo str_replace($documento, $find, $replace);

Related

Convert all Relative urls to Absolute urls while maintaining contents

I am scrapping site data using simple html dom, but i get problem when converting relative urls to absolute url.. imagine direct page link is http://www.example.com/tutorial.html but when i get contents i want, there are relative links that i want all of them to be absolute. for example:
$string = "<p>this is text within string</p> and more random strings which contains link like <a href='docs/555text.fileextension'>Download this file</a> <p>Other html follows where another relative link may exist like <a href='files/doc.doc'>This file</a>";
i want to get something like:
$string = "<p>this is text within string</p> and more random strings which contains link like <a href='http://www.example.com/docs/555text.fileextension'>Download this file</a> <p>Other html follows where another relative link may exist like <a href='http://www.example.com/files/doc.doc'>This file</a>";
just to covert all relative urls to absolute urls while maintaining $string contents.
When trying solution given below, does not work for the real data scrapped..
//These are Real Data from scrapped html
//Base URL is http://www.zoomtanzania.com/
// [^>]* means 0 or more quantifiers except for >
$regex = '~<a([^>]*)href=\'([^\']*)\'([^>]*)>~';
// replacement for each subpattern (3 in total)
// basically here we are adding missing baseurl to href
$replace = '<a$1href="http://www.zoomtanzania.com/$2"$3>';
$string =
'<div style="background-color: rgba(255, 255, 255, 0.8);">
<div style="font-size:17px; font-weight:bold; ">
MECHANICAL TECHNICIAN</div>
<hr style="margin:4px">
<div>
<p class="pull-right">Application Deadline: 24 Jul 2015<br></p>
<h5>Mechanical Technician POSITION DESCRIPTION:</h5><br>
Position Description Document (download)
<br>
<br>
<h5>APPLICATION INSTRUCTIONS:</h5><br>
<p>
All applications should be sent to the address below or via <strong>APPLY NOW</strong> below before 24th July 2015.</p>
<p>Eligible candidates are required to submit detailed CV with names of three referees and an application letter.</p>
<p>
<br>P.O.BOX 4955,<br>Dar es Salaam,</p>
<p>Tanzania.</p>
<br>
<br>
</div>
</div>';
$replaced = preg_replace($regex, $replace, $string);
echo $replaced;
//Method does not replace Position Description Document (download) to Position Description Document (download)
You were right to use preg_replace, for your example you can try this code
// [^>]* means 0 or more quantifiers except for >
// single quote AND double quote support
$regex = '~<a([^>]*)href=["\']([^"\']*)["\']([^>]*)>~';
// replacement for each subpattern (3 in total)
// basically here we are adding missing baseurl to href
$replace = '<a$1href="http://www.example.com/$2"$3>';
$string = "<p>this is text within string</p> and more random strings which contains link like <a href='docs/555text.fileextension'>Download this file</a> <p>Other html follows where another relative link may exist like <a href='files/doc.doc'>This file</a>";
$replaced = preg_replace($regex, $replace, $string);
Result
<p>this is text within string</p> and more random strings which contains link like Download this file <p>Other html follows where another relative link may exist like This file

PHP cut text from a specific word in an HTML string

I would like to cut every text ( image alt included ) in an HTML string form a specific word.
for example this is the string:
<?php
$string = '<div><img src="img.jpg" alt="cut this text form here" />cut this text form here</div>';
?>
and this is what I would like to output
<div>
<a href="#">
<img src="img.jpg" alt="cut this text" />
cut this text
</a>
</div>
The $string is actually an element of an Object but I didn't wanted to put too long code here.
Obviously I can't use explode because that would kill the HTML markup.
And also str_replace or substr is out because the length before or after the word where it needs to be cut is not constant.
So what can I do to achive this?
Ok I solved my problem and I only post an answer to my question because it could help someone.
so this is what I did:
<?php
$string = '<div><img src="img.jpg" alt="cut this text form here" />cut this text form here</div>';
$txt_only = strip_tags($string);
$explode = explode(' from', $txt_only);
$find_txt = array(' from', $explode[1]);
$new_str = str_replace($find_txt, '', $string);
echo $new_str;
?>
This might not be the best solution but it was quick and did not involve DOM Parse.
If anybody wants to try this make sure that your href or src or any ather attribute what needs to be untouched doesn't have any of the chars in the same way and order as in $find_txt else it will replace those too.

SIMPLE HTML DOM - how to ignore nested elements?

My html code is as follows
<span class="phone">
i want this text
<span class="ignore-this-one">01234567890</span>
<span class="ignore-this-two" >01234567890</span>
<a class="also-ignore-me">some text</a>
</span>
What I want to do is extract the 'i want this text' leaving all of the other elements behind. I've tried several iterations of the following, but none return the text I need:
$name = trim($page->find('span[class!=ignore^] a[class!=also^] span[class=phone]',0)->innertext);
Some guidance would be appreciated as the simple_html_dom section on filters is quite bare.
what about using php preg_match (http://php.net/manual/en/function.preg-match.php)
try the below:
<?php
$html = <<<EOF
<span class="phone">
i want this text
<span class="ignore-this-one">01234567890</span>
<span class="ignore-this-two" >01234567890</span>
<a class="also-ignore-me">some text</a>
</span>;
EOF;
$result = preg_match('#class="phone".*\n(.*)#', $html, $matches);
echo $matches[1];
?>
regex explained:
find text class="phone" then proceed until the end of the line, matching any character using *.. Then switch to a new line with \n and grab everything on that line by enclosing *. into brackets.
The returned result is stored in the array $matches. $matches[0] holds the value that is returned from the whole regex, while $matches[1] holds the value that is return by the closing brackets.

Find <p> tag in a very long text

I have a very long HTML text in which I want iterate id value of a p tag in PHP. My original string:
$mystring="
<p> my very long text with a lot of words ....</p>
<p></p>
<p> my other paragraph with a very long text ...</p>
(...)
";
Result that I want:
$myparsestring= "
<p id=1>my very long text with a lot of words ....</p>
<p id=2> my other paragraph with a very long text ...</p>
";
As you can see, I can use getElementsByTagName () and regex (may be split).
What is your guidance to do this job?
If you're planning on parsing html try using DOM with xpath.
Here is a quick example :
$xpath = new DOMXPath($html);
$query = '//*/p';
$entries = $xpath->query($query);
Don't use regex, if all you plan on doing is parsing html like this use this method unless you've got a specific reason for using regex
You can go with regex like this:
$mystring="
<p> my very long text with a lot of words ....</p>
<p></p>
<p> my other paragraph with a very long text ...</p>
(...)
";
// This will give you all <p> tags, that have some information in it.
preg_match_all('/<p>(?<=^|>)[^><]+?(?=<|$)<\/p>/s', $mystring, $matches);
$myparsestring = '';
for( $k=0; $k<sizeof( $matches[0] ); $k++ )
{
$myparsestring .= str_replace( '<p', '<p id='.($k+1), $matches[0][$k] );
}
echo htmlspecialchars( $myparsestring );
And the output/result:
<p id=1> my very long text with a lot of words ....</p>
<p id=2> my other paragraph with a very long text ...</p>

How to grab the contents of HTML tags?

Hey so what I want to do is snag the content for the first paragraph. The string $blog_post contains a lot of paragraphs in the following format:
<p>Paragraph 1</p><p>Paragraph 2</p><p>Paragraph 3</p>
The problem I'm running into is that I am writing a regex to grab everything between the first <p> tag and the first closing </p> tag. However, it is grabbing the first <p> tag and the last closing </p> tag which results in me grabbing everything.
Here is my current code:
if (preg_match("/[\\s]*<p>[\\s]*(?<firstparagraph>[\\s\\S]+)[\\s]*<\\/p>[\\s\\S]*/",$blog_post,$blog_paragraph))
echo "<p>" . $blog_paragraph["firstparagraph"] . "</p>";
else
echo $blog_post;
Well, sysrqb will let you match anything in the first paragraph assuming there's no other html in the paragraph. You might want something more like this
<p>.*?</p>
Placing the ? after your * makes it non-greedy, meaning it will only match as little text as necessary before matching the </p>.
If you use preg_match, use the "U" flag to make it un-greedy.
preg_match("/<p>(.*)<\/p>/U", $blog_post, &$matches);
$matches[1] will then contain the first paragraph.
It would probably be easier and faster to use strpos() to find the position of the first
<p>
and first
</p>
then use substr() to extract the paragraph.
$paragraph_start = strpos($blog_post, '<p>');
$paragraph_end = strpos($blog_post, '</p>', $paragraph_start);
$paragraph = substr($blog_post, $paragraph_start + strlen('<p>'), $paragraph_end - $paragraph_start - strlen('<p>'));
Edit: Actually the regex in others' answers will be easier and faster... your big complex regex in the question confused me...
Using Regular Expressions for html parsing is never the right solution. You should be using XPATH for this particular case:
$string = <<<XML
<a>
<b>
<c>texto</c>
<c>cosas</c>
</b>
<d>
<c>código</c>
</d>
</a>
XML;
$xml = new SimpleXMLElement($string);
/* Busca <a><b><c> */
$resultado = $xml->xpath('//p[1]');

Categories