Fast way to get specific data from HTML string using PHP

Fast way to get specific data from HTML string using PHP - php

I avoided a lot to come here share my problem. I have googled a lot and find some solutions but not confirmed.
First I explain My Problem.
I have a CKEditor in my site to let the users post comments. Suppose a user clicks two posts to Multi quote them, the data will be like this in CKEditor
<div class="quote" user_name="david_sa" post_id="223423">
This is Quoted Text
</div>
<div class="quote" user_name="richard12" post_id="254555">
This is Quoted Text
</div>
<div class="original">
This is the Comment Text
</div>
I want to get all the elements separately in php as below
user_name = david_sa
post_id = 223423;
quote_text = This is Quoted Text
user_name = david_sa
post_id = richard12;
quote_text = This is Quoted Text
original_comment = This is the Comment Text
I want to get the data in above format in PHP. I have googled and found the preg_match_all() PHP function near to my problem, that uses the REGEX to match the string patterns. But I am not confirmed that is it a legitimate and efficient solution or there is some better solution. If You have any better solution Please Suggest Me.

You can use DOMDocument and DOMXPath for this. It takes very few lines of code to parse HTML and extract just about anything from it.
$doc = new DOMDocument();
$doc->loadHTML(
'<html><body>' . '
<div class="quote" user_name="david_sa" post_id="223423">
This is Quoted Text
</div>
<div class="quote" user_name="richard12" post_id="254555">
This is Quoted Text
</div>
<div class="original">
This is the Comment Text
</div>
' . '</body></html>');
$xpath = new DOMXPath($doc);
$quote = $xpath->query("//div[#class='quote']");
echo $quote->length; // 2
echo $quote->item(0)->getAttribute('user_name'); // david_sa
echo $quote->item(1)->getAttribute('post_id'); // 254555
// foreach($quote as $div) works as expected
$original = $xpath->query("//div[#class='original']");
echo $original->length; // 1
echo $original->item(0)->nodeValue; // This is the Comment Text
If you are not familiar with XPath syntax then here are a few examples to get you started.

You should not be using regex's to process HTML/XML. This is what DOMDocument and SimpleXML are built for.
You problem seems relatively simple, so you should be able to get away with using SimpleXML (aptly named, huh?)

Do not even try regex to parse html. I would recommend simple html dom. Get it here: php html parser

Related

how to set create regex for this string [duplicate]

This question already has answers here:
RegEx match open tags except XHTML self-contained tags
(35 answers)
Closed 6 years ago.
<div id="plugin-description">
<p itemprop="description" class="shortdesc">
BuddyPress helps you build any type of community website using WordPress, with member profiles, activity streams, user groups, messaging, and more. </p>
<div class="description-right">
<p class="button">
<a itemprop="downloadUrl" href="https://downloads.wordpress.org/plugin/buddypress.2.6.1.1.zip">Download Version 2.6.1.1</a>
i need description just with this code
<p itemprop="description" class="shortdesc">[a-z]</p>
i need download link
<a itemprop="downloadUrl" href="[A-Z]"></a>

There are better tools for parsing HTML than regular expressions. That said, there are times when parsing HTML with regular expressions works safely and consistently, so don't be bullied out of trying it. These cases are usually for small, known sets of HTML markup.
For this particular case, it seems that using an HTML parser would be effective leave you with more legible code. To illustrate this, I'll use a command line tool like pup, which will help you retrieve your content pretty simply. Let's pretend that the markup is stored at /tmp/input on your computer.
To grab the downloadUrl...
pup < /tmp/input 'a[itemprop="downloadUrl"] attr{href}'
To grab the description...
pup < /tmp/input 'p[itemprop="description"] text{}'
This I think illustrates the simplicity and benefits of using an HTML parser to grab what you're after.

And once again:
<?php
$data = <<<DATA
<div id="plugin-description">
<p itemprop="description" class="shortdesc">
BuddyPress helps you build any type of community website using WordPress.
</p>
<div class="description-right">
<p class="button">
<a itemprop="downloadUrl" href=".zip">Download Version 2.6.1.1</a>
</p>
</div>
</div>
DATA;
$dom = new DOMDocument();
$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
$containers = $xpath->query("//div[#id='plugin-description']");
foreach ($containers as $container) {
$description = trim($xpath->query(".//p[#itemprop='description']", $container)->item(0)->nodeValue);
$link = $xpath->query(".//a[#itemprop='downloadUrl']/#href", $container)->item(0)->nodeValue;
echo $description . $link;
}
?>
See a demo on ideone.com.

Getting data using Regular Expressions - PHP

I have html text stored in one of the columns of a database. the column name is mailbody and the table name is inbox_master.
Some the cells of column mailbody has divs like below
<div id="uid-g-uid" style="">2802</div>
or
<div id="uid-g-uid">
<p class="MsoNormal">6894</p>
</div>
or
<div id="uid-g-uid" style="display:none;">
6894</div>
what is common here is a div with the id "uid-g-uid". I want to be able to read the html of this div. I know this could be done using regular expressions however, not sure how to do it.
Below is the regex that i have tried but doesnt work all the time
/(?<=\<div\ id\=\"uid\-g\-uid\").*?(?=\<\/div\>)/gim

Thanks to #sikfire and #dave, i got the solution using DOM. below is my working which helped me
$doc = new DOMDocument();
#$doc->loadHTML('The HTML Content Goes here');
$d = $doc->getElementById('uid-g-uid');
echo 'Value is ' . $d['textContent'];
Didnt knew this could be this simple! Thanks Guys!

you can also look at the project here. PHP DomParser
This might help!

substrings result in incomplete html tags

I am Expanding/Condensing a Blog post using substrings, where the second substring is within a div tag that activates when a button is pressed (hence concatenating both substrings)
The code looks like as below:
<?php echo substr($f2, 0, 50);?>
<div id="<?php echo $f4; ?>" class = "hidden">
<?php echo substr($f2, 0, 5000);?></div>
My problem however is if the blog post contains html tags (e.g. <\li>, <\p>) and the initial substring ends before the termination of that set of tags, then obviously it causes major formatting problems.
Is there a way around this using my current method, or am I going to need to use something like an XML stylesheet (in which case please guide me through it)
EDIT:
I have semi-completed my request using DOMDocument.
$second = substr($f2, 50, 5000);
$dom= new DOMDocument();
$dom->loadHTML($second);
$xpath = new DOMXPath($dom);
$body = $xpath->query('/html/body');
$secondoutput = ($dom->saveXml($body->item(0)));
$first = substr($f2, 0, 50);
$dom= new DOMDocument();
$dom->loadHTML($first);
$xpath = new DOMXPath($dom);
$body = $xpath->query('/html/body');
$firstoutput = ($dom->saveXml($body->item(0)));
This works except, when the second subtring is called it no longer has the previous formatting as it has been purified.
Is there any way to reattch the previous HTML tag when the second substring is called?

There are diffrent solutions to this problem, but substr is not particularly suitable (as you mentioned).
You could use Regular Expressions, or a HTML-Parser.
Go ahead and copy solutions from this question.

You may want to use Tidy to fix the truncated HTML.

You may want to parse whole HTML code with DOMDocument or SimpleHTMLDOM and then remove last elements until the post is short enough.

Select first DOM Element of type text using phpQuery

Let's say i have this block of code,
<div id="id1">
This is some text
<div class="class1"><p>lala</p> Some markup</div>
</div>
What I would want is only the text "This is some text" without the child element's .class1 contents. I can do it in jquery using $('#id1').contents().eq(0).text(), how can i do this in phpQuery?
Thanks.

my bad, i was doing
pq('#id1.contents().eq(0).text()')
instead of
pq('#id1')->contents()->eq(0)->text()

If compatibility is what you are after, and you want to traverse/manipulate elements as DOM objects, then perhaps the PHP DOM XML library is what you are after: http://www.php.net/manual/en/book.domxml.php
Your code would look something like this:
$xml = xmldoc('<div id="id1">This is some text<div class="class1"><p>lala</p> Some markup</div></div>');
$node = $xml->get_element_by_id("id1");
$content = $node->get_content();
I'm sorry, I don't have time to run a test of this right now, but hopefully it sets you in the right direction, and forms the basis for a decent revision... There is a good list of DOM traversal functions in the PHP documentation though :)
References: http://www.php.net/manual/en/book.domxml.php, http://www.php.net/manual/en/function.domdocument-get-element-by-id.php, http://www.php.net/manual/en/function.domnode-get-content.php

Extract data from website via PHP

I am trying to create a simple alert app for some friends.
Basically i want to be able to extract data "price" and "stock availability" from a webpage like the folowing two:
http://www.sparkfun.com/commerce/product_info.php?products_id=5
http://www.sparkfun.com/commerce/product_info.php?products_id=9279
I have made the alert via e-mail and sms part but now i want to be able to get the quantity and price out of the webpages (those 2 or any other ones) so that i can compare the price and quantity available and alert us to make an order if a product is between some thresholds.
I have tried some regex (found on some tutorials, but i an way too n00b for this) but haven't managed to get this working, any good tips or examples?

$content = file_get_contents('http://www.sparkfun.com/commerce/product_info.php?products_id=9279');
preg_match('#<tr><th>(.*)</th> <td><b>price</b></td></tr>#', $content, $match);
$price = $match[1];
preg_match('#<input type="hidden" name="quantity_on_hand" value="(.*?)">#', $content, $match);
$in_stock = $match[1];
echo "Price: $price - Availability: $in_stock\n";

It's called screen scraping, in case you need to google for it.
I would suggest that you use a dom parser and xpath expressions instead. Feed the HTML through HtmlTidy first, to ensure that it's valid markup.
For example:
$html = file_get_contents("http://www.example.com");
$html = tidy_repair_string($html);
$doc = new DomDocument();
$doc->loadHtml($html);
$xpath = new DomXPath($doc);
// Now query the document:
foreach ($xpath->query('//table[#class="pricing"]/th') as $node) {
echo $node, "\n";
}

What ever you do: Don't use regular expressions to parse HTML or bad things will happen. Use a parser instead.

1st, asking this question goes too into details. 2nd, extracting data from a website might not be legitimate. However, I have hints:
Use Firebug or Chrome/Safari Inspector to explore the HTML content and pattern of interesting information
Test your RegEx to see if the match. You may need do it many times (multi-pass parsing/extraction)
Write a client via cURL or even much simpler, use file_get_contents (NOTE that some hosting disable loading URLs with file_get_contents)
For me, I'd better use Tidy to convert to valid XHTML and then use XPath to extract data, instead of RegEx. Why? Because XHTML is not regular and XPath is very flexible. You can learn XSLT to transform.
Good luck!

You are probably best off loading the HTML code into a DOM parser like this one and searching for the "pricing" table. However, any kind of scraping you do can break whenever they change their page layout, and is probably illegal without their consent.
The best way, though, would be to talk to the people who run the site, and see whether they have alternative, more reliable forms of data delivery (Web services, RSS, or database exports come to mind).

The simplest method to extract data from Website. I've analysed that my all data is covered within <h3> tag only, so I've prepared this one.
<?php
include(‘simple_html_dom.php’);
// Create DOM from URL, paste your destined web url in $page
$page = ‘http://facebook4free.com/category/facebookstatus/amazing-facebook-status/’;
$html = new simple_html_dom();
//Within $html your webpage will be loaded for further operation
$html->load_file($page);
// Find all links
$links = array();
//Within find() function, I have written h3 so it will simply fetch the content from <h3> tag only. Change as per your requirement.
foreach($html->find(‘h3′) as $element)
{
$links[] = $element;
}
reset($links);
//$out will be having each of HTML element content you searching for, within that web page
foreach ($links as $out)
{
echo $out;
}
?>

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Fast way to get specific data from HTML string using PHP - php

You should not be using regex's to process HTML/XML. This is what DOMDocument and SimpleXML are built for. You problem seems relatively simple, so you should be able to get away with using SimpleXML (aptly named, huh?)

Do not even try regex to parse html. I would recommend simple html dom. Get it here: php html parser

Related

how to set create regex for this string [duplicate]

Getting data using Regular Expressions - PHP

substrings result in incomplete html tags

Select first DOM Element of type text using phpQuery

Extract data from website via PHP

Categories

Resources