This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 3 years ago.
i have following php regex code.. i want to extract the stock symbol in some html output.
The stock symbol i want to extract is /q?s=XXXX -- XXXX (the stock symbol) could be 1 to 5 characters long.
if(preg_match_all('~(?<=q\?s=)[-A-Z.]{1,5}~', $html, $out))
{
$out[0] = array_unique($out[0]);
} else {
echo "FAIL";
}
HTML code below (case 1 and case that i applied this to)
case #1 (does *not* work)
Bellicum Pharmaceuticals, Inc.
case #2 (does work correctly)
NYLD
Looking for suggestions on how i can update my php regex code to make it work for both case 1 and case 2. Thanks.
Instead of using regex, make effective use of DOM and XPath to do this for you.
$doc = new DOMDocument;
#$doc->loadHTML($html); // load the HTML data
$xpath = new DOMXPath($doc);
$links = $xpath->query('//a[substring(#href, 1, 5) = "/q?s="]');
foreach ($links as $link) {
$results[] = str_replace('/q?s=', '', $link->getAttribute('href'));
}
print_r($results);
eval.in
The answer seems nice, but it seems like a lot of work and code to maintain, no?
if (preg_match_all('/q\?s=(\S{1,5})\"/', $html, $match)) {
$symbols = array_unique($match[1]);
}
or even shorter... '/q\?s=(\S+)\"/'
Related
This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Replace URLs in text with HTML links
(17 answers)
Closed 2 years ago.
I am using the following regex to replace plain URLs with html links in a text:
preg_replace('/(http[s]{0,1}\:\/\/\S{4,})\s{0,}/ims', '$1 ', $text_msg);
Now I want to modify the regex in a way that, it only replaces the URL only if there is no double quotes behind it and therefore is not part of a tag (i.e. the url is at the start of the string, start of a line or after a space).
Examples:
This is the link <a href="http://test.com"> ... (URL should not be replaced)
http://test.com (at the begenning of a line or the whole multi-line string should be replaced)
This is the site: http://test.com (URL should be replaced)
Thanks.
Your question actually breaks down into two smaller problems. You've already solved one of them, which is parsing the URL with a regular expression. The second part is extracting text from HTML, which isn't easily solved by a regular expression at all. The confusion you have is in trying to do both at the same with a regular expression (parsing HTML and parsing the URL). See the parsing HTML with regex SO Answer for more details on why this is a bad idea.
So instead, let's just use an HTML parser (like DOMDocument) to extract text nodes from the HTML and parse URLs inside those text nodes.
Here's an example
<?php
$html = <<<'HTML'
<p>This is a URL http://abcd/ims in text</p>
HTML;
$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
// Let's walk the entire DOM tree looking for text nodes
function walk(DOMNode $node, $skipParent = false) {
if (!$skipParent) {
yield $node;
}
if ($node->hasChildNodes()) {
foreach ($node->childNodes as $n) {
yield from walk($n);
}
}
}
foreach (walk($dom->firstChild) as $node) {
if ($node instanceof DOMText) {
// lets find any links and change them to HTML
if (preg_match('/(http[s]{0,1}\:\/\/\S{4,})\s{0,}/ims', $node->nodeValue, $match)) {
$node->nodeValue = preg_replace('/(http[s]{0,1}\:\/\/\S{4,})\s{0,}/ims', "\xff ",
$node->nodeValue);
$nodeSplit = explode("\xff", $node->nodeValue, 2);
$node->nodeValue = $nodeSplit[1];
$newNode = $dom->createTextNode($nodeSplit[0]);
$href = $dom->createElement('a', $match[1]);
$href->setAttribute('href', $match[1]);
$node->parentNode->insertBefore($newNode, $node);
$node->parentNode->insertBefore($href, $node);
}
}
}
echo $dom->saveHTML();
Which gives you the desired HTML as output:
<p>This is a URL http://abcd/ims in text</p>
This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 5 years ago.
I am using a regular expression to extract the price on the right from the following HTML:
<p class="pricing ats-product-price"><em class="old_price">$99.99</em>$94.99</p>
Using preg match in PHP:
preg_match_all('!<p class="pricing ats-product-price"><em class="old_price">.*?<\/em>(.*?)<\/p>!', $output, $prices);
Except, I noticed that sometimes the HTML doesn't include an old price. So sometimes the HTML looks like this:
<p class="pricing ats-product-price">$129.99</p>
It seems like my goal should be to extract the last price from the expression, or in other words the text that directly follows after the last question mark and before the </p>. This sort of expression is way out of my league though - hoping for some help here. Thanks.
Use a regular expression in combination with a parser:
<?php
$data = <<<DATA
<p class="pricing ats-product-price">
<em class="old_price">$99.99</em>
$94.99
</p>
<p class="pricing ats-product-price">$129.99</p>
DATA;
# set up the dom
$dom = new DOMDocument();
$dom->loadHTML($data, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
# set up the xpath
$xpath = new DOMXPath($dom);
$regex = '~\$\d+[\d.]*\b\s*\Z~';
foreach ($xpath->query("//p") as $line) {
if (preg_match($regex, $line->nodeValue, $match)) {
echo $match[0] . "\n";
}
}
This yields
$129.99
$129.99
The snippet sets up the DOM, queries it for p tags and searches for the last price within.
See a demo for the expression on regex101.com.
This question already has answers here:
PHP parse/syntax errors; and how to solve them
(20 answers)
Closed 6 years ago.
I want to extract data from a web source but i am getting error in preg match
<?php
$html=file_get_contents("https://www.instagram.com/p/BJz4_yijmdJ/?taken-by=the.witty");
preg_match("("instapp:owner_user_id" content="(.*)")", $html, $match);
$title = $match[1];
echo $title;
?>
This is the error i get
Parse error: syntax error, unexpected 'instapp' (T_STRING) in
/home/ubuntu/workspace/test.php on line 4
Please help me how can i do this? and i also want to extract more data from the page with regex so is it possible to extract all at once using single code? or i want to use pregmatch many times?
The main problem is that you did not form a valid string literal. Note that PHP supports both single- and double-quoted string literals, and you may use that to your advantage:
preg_match('~"instapp:owner_user_id" content="([^"]*)"~', $html, $match);
While it is OK to use paired (...) symbols as regex delimiters, I'd suggest using a more conventional / or ~/# symbols.
Also, (.*) is a too generic pattern that may match more than you need since . also matches " and * is a greedy modifier, a negated character class is better, ([^"]*) - 0+ chars other than ".
HOWEVER, to parse HTML in PHP, you may use a DOM parser, like DOMDocument.
Here is a sample to get all meta tags that have content attribute and extracting the value of that attribute and saving in an array:
$html = "<html><head><meta property=\"al:ios:url\" content=\"instagram://media?id=1329656989202933577\" /></head><body><span/></body></html>";
$dom = new DOMDocument('1.0', 'UTF-8');
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
$metas = $xpath->query('//meta[#content]');
$res = array();
foreach($metas as $m) {
array_push($res, $m->getAttribute('content'));
}
print_r($res);
See the PHP demo
And to only get the id in the content attribute value of a meta tag whose property attribute is equal to al:ios:url, use
$xpath = new DOMXPath($dom);
$metas = $xpath->query('//meta[#property="al:ios:url"]');
$id = "";
if (preg_match('~[?&]id=(\d+)~', $metas->item(0)->getAttribute('content'), $match))
{
$id = $match[1];
}
See another PHP demo
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
RegEx match open tags except XHTML self-contained tags
How to parse and process HTML with PHP?
How can I convert ereg expressions to preg in PHP?
here is an example
echo "<div id='spaced' class='romaji'><span class='spaced orig word'>neko</span><span class='space'>";
please ignore the "echos" its the only way i could get the html to show
i need a reg express that can select whatever is between the
echo "<span class='spaced orig word'>";
tag and its ending tag
echo "</span>";
i tried
$pattern = "span class='spaced orig word'>(.+?)</s";
preg_match_all ($pattern, $jp_page, $result_ro);
if ($result_ro[1])
$results[] = implode(' ', $result_ro[1]);
else
return null; // Failed to retrieve Hiragana, so abort
and some other things, but i cant get it right, i get nothing most of the time because i dont really know what im doing with reg expressions
currently getting a warning with this code
Warning: preg_match_all(): Delimiter must not be alphanumeric or backslash
THE PONY HE COMES!
Instead, try using a DOM parser:
$dom = new DOMDocument();
$dom->loadHTML($jp_page);
$xpath = new DOMXPath($dom);
$spans = $xpath->query("//span[#class='spaced orig word']");
$results = "";
foreach($spans as $span) {
$results = " ".$span->textContent;
}
$results = trim($results);
return $results;
No delimiters
try this reg
<?php
$pattern = '#<span.*?>(.*?)</span>#';
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to parse and process HTML with PHP?
I'm pretty new to PHP.
I have the text of a body tag of some page in a string variable.
I'd like to know if it contains some tag ... where the tag name tag1 is given, and if so, take only that tag from the string.
How can I do that simply in PHP?
Thanks!!
You would be looking at something like this:
<?php
$content = "";
$doc = new DOMDocument();
$doc->load("example.html");
$items = $doc->getElementsByTagName('tag1');
if(count($items) > 0) //Only if tag1 items are found
{
foreach ($items as $tag1)
{
// Do something with $tag1->nodeValue and save your modifications
$content .= $tag1->nodeValue;
}
}
else
{
$content = $doc->saveHTML();
}
echo $content;
?>
DomDocument represents an entire HTML or XML document; serves as the root of the document tree. So you will have a valid markup, and by finding elements By Tag Name you won't find comments.
Another possibility is regex.
$matches = null;
$returnValue = preg_match_all('#<li.*?>(.*?)</li>#', 'abc', $matches);
$matches[0][x] contains the whole matches such as <li class="small">list entry</li>, $matches[1][x] containt the inner HTML only such as list entry.
Fast way:
Look for the index position of tag1 then look for the index position of /tag1. Then cut the string between those two indexes. Look up strpos and substr on php.net
Also this might not work if your string is too long.
$pos1 = strpos($bigString, '<tag1>');
$pos2 = strpos($bigString, '</tag1>');
$resultingString = substr($bigString, -$pos1, $pos2);
You might have to add and/or substract some units from $pos1 and $pos2 to get the $resultingString right.
(if you don't have comments with tag1 inside of them sigh)
The right way:
Look up html parsers