PHP preg_match_all - how to get a content from HTML?

PHP preg_match_all - how to get a content from HTML? - php

$Content contains HTML document
$contents = curl_exec ($ch)
I need to get a content from:
<span class="Menu1">Artur €2000</span>
It's repeated several times so I want to save it into Array
I try to do that this way:
preg_match_all('<span class=\"Menu1\">(.*?)</span>#si',$contents,$wynik2);
But I've got an error
Warning: preg_match_all() [function.preg-match-all]: Unknown modifier '('
Can You guys help me please?
EDIT: $contents = curl_exec ($ch)
SOLVED: The error was cased becasue of wrong HTML on CURLed website:
<span class="Menu1">Content</tr>
instead of:
<span class="Menu1">Content</tr>
I didn't expected that someone can write wrong HTML. Thank You guys for help!

You forgot the first delimiter (#):
$contents = '<span class="Menu1">Artur $2000</span> somehtml <span class="Menu1">Mark $1000</span>';
preg_match_all('#<span class="Menu1">(.*?)</span>#si', $contents, $wynik2);
print_r($wynik2);
/*
Array
(
[0] => Array
(
[0] => <span class="Menu1">Artur $2000</span>
[1] => <span class="Menu1">Mark $1000</span>
)
[1] => Array
(
[0] => Artur $2000
[1] => Mark $1000
)
)
*/

You should put this sign "|" in the start and the end of your regular expression :
preg_match_all("|<span class=\"Menu1\">(.*?)</span>|U",$contents,$wynik2);

Related

How to remove unwanted ";" before explode php [duplicate]

This question already has answers here:
Encoding issue, coverting & to & for html using php
(4 answers)
Closed 1 year ago.
I am trying to explode the txt file in one array per line. The file was give through a URL on this format:
Ville:Montréal; Fichier:montreal.txt
Ville:Québec; Fichier:quebec.txt
The problem it that the separator variable ";" is the same found in some other parts of the string.
The wanted result is:
[0] => Ville:Québec [1] => Fichier:quebec.txt
[0] => Ville:Montréal [1] => Fichier:montreal.txt
I am using this code:
<?php $tabCities = file ('redacted'); ?>
<?php //$oneLine = utf8_decode ($tabCities[0]); ?>
<?php $oneLine = $tabCities[0]; ?>
<?php $arrayLine = explode(";", $oneLine); ?>
<?php print_r ($oneLine); ?>
<?php print_r ($arrayLine); ?>
It outputs
Ville:Montréal; Fichier:montreal.txt Array ( [0] => Ville:Montré [1] => al [2] => Fichier:montreal.txt )
utf8_decode does not help. Is there any other function or strategy I can try?

You are looking for html_entity_decode.
It will convert the html entity (é) into a unicode character to é

php, strpos extract digit from string

I have a huge html code to scan. Until now i have been using preg_match_all to extract desired parts from it. The problem from the start was that it was extremely cpu time consuming. We finally decided to use some other method for extraction. I read in some articles that preg_match can be compared in performance with strpos. They claim that strpos beats regex scanner up to 20 times in efficiency. I thought i will try this method but i dont really know how to get started.
Lets say i have this html string:
<li id="ncc-nba-16451" class="che10">23 - Star</li>
<li id="ncd-bbt-5674" class="che10">54 - Moon</li>
<li id="ertw-cxda-c6543" class="che10">34,780 - Sun</li>
I want to extract only number from each id and only text (letters) from content of a tags. so i do this preg_match_all scan:
'/<li.*?id=".*?([\d]+)".*?<a.*?>.*?([\w]+)<\/a>/s'
here you can see the result: LINK
Now if i would want to replace my method to strpos functionality how the approach would look like? I understand that strpos returns a index of start where match took place. But how can i use it to:
get all possible matches, not just one
extract numbers or text from desired place in string
Thank you for all the help and tips ;)

Using DOM
$html = '
<html>
<head></head>
<body>
<li id="ncc-nba-16451" class="che10">23 - Star</li>
<li id="ncd-bbt-5674" class="che10">54 - Moon</li>
<li id="ertw-cxda-c6543" class="che10">34,780 - Sun</li>
</body>
</html>';
$dom_document = new DOMDocument();
$dom_document->loadHTML($html);
$rootElement = $dom_document->documentElement;
$getId = $rootElement->getElementsByTagName('li');
$res = [];
foreach($getId as $tag)
{
$data = explode('-',$tag->getAttribute('id'));
$res['li_id'][] = end($data);
}
$getNode = $rootElement->getElementsByTagName('a');
foreach($getNode as $tag)
{
$res['a_node'][] = $tag->parentNode->textContent;
}
print_r($res);
Output :
Array
(
[li_id] => Array
(
[0] => 16451
[1] => 5674
[2] => c6543
)
[a_node] => Array
(
[0] => 23 - Star
[1] => 54 - Moon
[2] => 34,780 - Sun
)
)

This regex finds a match in 24 steps using 0 backtracks
(?:id="[^\d]*(\d*))[^<]*(?:<a href="[^>]*>[^a-z]*([a-z]*))
The regex you posted requires 134 steps. Maybe you will notice a difference? Note that regex engines can optimize so that in minimizes backtracking. I used the debugger of RegexBuddy to come to the numbers.

PHP - Advanced Regex Help needed

So I have many large text paragraphs to parse.
The end goal is to separate the paragraphs into smaller postings, so I can insert them into mysql.
Here's a very short example of one of the paragraphs in a string:
<?php
$longstring = '
(<b>John Smith</b>) at <b class="datetimeGMT">2011-01-10 22:13:01 GMT</b><hr>
Lots of text entered here under the first line.<br>And most of it is html, since it is for displaying in a web browser.<br></br></br>
(<b>Alan Slappy</b>) at <b class="datetimeGMT">2011-01-11 13:12:00 GMT</b><hr>
Forgot to put one more thing in the notes.........<br>blah blah blah
(<b>Joe Mama</b>) at <b class="datetimeGMT">2011-01-13 10:15:00 GMT</b><hr>
Groceries list:<br>Watermelons<br>Floss<br><br>email doctor
';
?>
Yep, I have a freaky project of parsing these strings for each entry.
Yes, I agree with anyone that this is not a cool task. the original developer allowed for appending text to the original text. Not a bad idea for some occasions, but for me it is.
I do need help with how to RegEx this beast and place it into a foreach loop so I can start cleaning it up.
Here's how far I got:
<?php
if(preg_match_all('/\(<b>.*?<hr>/', $longstring, $matches)){
print_r($matches);
}
/* output:
Array
(
[0] => Array
(
[0] => (<b>John Smith</b>) at <b class="datetimeGMT">2011-01-10 22:13:01 GMT</b><hr>
[1] => (<b>Alan Slappy</b>) at <b class="datetimeGMT">2011-01-11 13:12:00 GMT</b><hr>
[2] => (<b>Joe Mama</b>) at <b class="datetimeGMT">2011-01-13 10:15:00 GMT</b><hr>
)
)
*/
?>
So, I'm actually doing pretty good with looping through the tops of each entry. I'm kinda proud I figured that out. (regex is my nemesis)
So now I'm stuck figuring out how to include the actual text below each iteration.
Anyone have an idea on how I can adjust the preg_match_all to account for the text below each "header"?

Try to use preg_split instead:
$matches = preg_split("/\s*(\(<b>.*?<hr>)\s*/s", trim($longstring), null, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);
print_r($matches);
Note: trim is applied on your string to cut leading and trailing spaces.
Result will be something like
Array
(
[0] => (<b>John Smith</b>) at <b class="datetimeGMT">2011-01-10 22:13:01 GMT</b><hr>
[1] => Lots of text entered here under the first line.<br>And most of it is html, since it is for displaying in a web browser.<br></br></br>
[2] => (<b>Alan Slappy</b>) at <b class="datetimeGMT">2011-01-11 13:12:00 GMT</b><hr>
[3] => Forgot to put one more thing in the notes.........<br>blah blah blah
[4] => (<b>Joe Mama</b>) at <b class="datetimeGMT">2011-01-13 10:15:00 GMT</b><hr>
[5] => Groceries list:<br>Watermelons<br>Floss<br><br>email doctor
)

Try this
if(preg_match_all('/\(<b>(?:(?!\(<b>).)*/s', $longstring, $matches)){
print_r($matches);
}

This is going to be easier if you parse the HTML rather than just trying to regex it, unless you can guarantee the format of the HTML.
You might want to look at Robust and Mature HTML Parser for PHP.

Regex to strip string inside specific HTML tag

I'm trying to strip out a string, which occurs only once on a page obtained using cURL. Example:
<h3 class=" ">STRING IN QUESTION</h3>
or
<h3 class="active">STRING IN QUESTION</h3>
or
<h3 class=" active">STRING IN QUESTION</h3>
I would like to do this using preg_match, unless it can be accomplished with a less resource-intensive method.
Here is the regex I'm using, which is producing zero results:
<h3\sclass="\s">(.*?)</h3>
EDIT:
Here is the actual code (an actual URL used here in place of dynamic one) -- discovered that when pulled via cURL, the class attribute does not exist, but still does not work as shown:
$ch = curl_init ("URL IN QUESTION");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$page = curl_exec($ch);
preg_match('<h3>(.*?)</h3>', $page, $match);
print_r($match);
Prints Nothing

This does the trick:
$str='<h3 class=" ">STRING IN QUESTION</h3>';
preg_match('/<h3.*?>(.*?)<\/h3>/',$str,$match);
print_r($match);
Output:
Array
(
[0] => <h3 class=" ">STRING IN QUESTION</h3>
[1] => STRING IN QUESTION
)
Explanation:
<h3.*?> # Match h3 tags (non-greedy)
(.*?) # Match everything after tag (non-greedy, captured)
<\/h3> # Match closing tag - Note the escaped forward slash!
However that URL contains no <h3> tags, it does contain a <h1> tag however and to match it you would need to make the regex match newlines with a trailing s
preg_match('/<h1.*?>(.*?)<\/h1>/s',$page,$match);
Output:
Array
(
[0] => <h1 class="">
<span class="pageTitle ">Braman Motorcars</span>
</h1>
[1] =>
<span class="pageTitle ">Braman Motorcars</span>
)

Maybe:
<h3\s+class="\s*(active)?">(.*?)</h3>
and then use the \1 to retrieve "active" or "" and \2 for "String in question"
I've never done any php, but maybe this would work?:
$result = "not found"
if (preg_match('#<h3\s+class="\s*(active)?">(.*?)</h3>#', $page, $match))
{
$result = $match;
}
print_r($result)

Try with:
preg_match('#<h3\s?class="\s?(active)?">(.+)</h3>#', $yourString, $match);
Remember, in your regex you must always provide a delimiter.

Unusual behaviour of regex

My Setup:
index.php:
<?php
$page = file_get_contents('a.html');
$arr = array();
preg_match('/<td class=\"myclass\">(.*)\<\/td>/s',$page,$arr);
print_r($arr);
?>
a.html:
...other content
<td class="myclass">
THE
CONTENT
</td>
other content...
Output:
Array
(
[0] => Array
(
)
)
If I change the line 4 of index.php to:
preg_match('/<td class=\"myclass\">(.*)\<\/t/s',$page,$arr);
The output is:
Array
(
[0] => <td class="myclass">
THE
CONTENT
</t
[1] =>
THE
CONTENT
)
I can't make out what's wrong. Please help me match the content between <td class="myclass"> and </td>.

Your code appears to work. I edited the regex to use a different separator and get a clearer view. You may want to use the ungreedy modifier in case there is more than one myclass TD in your HTML.
I have not been able to reproduce the "array of array" behaviour you note, unless I manipulate the code to add an error -- see at bottom.
<?php
$page = <<<PAGE
...other content
<td class="myclass">
THE
CONTENT
</td>
other content...
PAGE;
preg_match('#<td class="myclass">(.*)</td>#s',$page,$arr);
print_r($arr);
?>
returns, as expected:
Array
(
[0] => <td class="myclass">
THE
CONTENT
</td>
[1] =>
THE
CONTENT
)
The code below is similar to yours but has been modified to cause an identical error. Doesn't seem likely you did this, though. The regexp is modified in order to not match, and the resulting empty array is stored into $arr[0] instead of $arr.
preg_match('#<td class="myclass">(.*)</ td>#s',$page,$arr[0]);
Returns the same error you observe:
Array
(
[0] => Array
(
)
)
I can duplicate the same behaviour you observe (works with </t, does not work with </td>) if I use your regexp, but modify the HTML to have </t d>. I still need to write to $arr[0] instead of $arr if I also want to get an identical output.

Do you understand that the 3rd paramter of preg_match is the matches and it will contain the match then the other elements will show the captured pattern.
http://ca3.php.net/manual/en/function.preg-match.php
If matches is provided, then it is filled with the results of search. $matches[0] will contain the text that matched the full pattern, $matches[1] will have the text that matched the first captured parenthesized subpattern, and so on.
This code
preg_match('/<td class=\"myclass\">(.*)\<\/t/s',$page,$arr);
When applied on
...other content
<td class="myclass">
THE
CONTENT
</td>
other content...
Will return the match in $arr[0] and the result of (.*) in $arr[1]. This result is correct: There is your content in [1]
Array
(
[0] => <td class="myclass">
THE
CONTENT
</t
[1] =>
THE
CONTENT
Example two
<?php
header('Content-Type: text/plain');
$page = 'A B C D E F';
$arr = array();
preg_match('/C (D) E/', $page, $arr);
print_r($arr);
Example output
Array
(
[0] => C D E // This is the string found
[1] => D // this is what I wanted to look for and extracted out of [0], the matched parenthesis
)

Your regex seems correct. Isn't the syntax of preg_match as follows?
preg_match('/<td class=\"myclass\">(.*)\<\/td>/s',$page,$arr);
The | in the regex represents or

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP preg_match_all - how to get a content from HTML? - php

You should put this sign "|" in the start and the end of your regular expression : preg_match_all("|<span class=\"Menu1\">(.*?)</span>|U",$contents,$wynik2);

Related

How to remove unwanted ";" before explode php [duplicate]

php, strpos extract digit from string

PHP - Advanced Regex Help needed

Regex to strip string inside specific HTML tag

Unusual behaviour of regex

Categories

Resources