Alright, I have some code that will find a <code></code> tag set and clean up any code inside of it so it displays instead of functioning like regular code. Everything works, but my problem is how can I find the tag set/multiple tag sets inside, say, $content. Clean the code, and still have ALL of the other content in it? Here is my code, the problem is it checks for matches, and when it finds one it cleans it. But after it cleans it it has no way to put it back into it's original position $content. ($content is being grabbed from a form)
<?php
preg_match_all("'<code>(.*?)</code>'si", $html, $match);
if ($match) {
foreach ($match[1] as $snippet) {
$fixedCode = htmlspecialchars($snippet, ENT_QUOTES);
}
}
?>
What do I do with $fixedCode, now that it is clean?
Using regex for parsing HTML is bad. I'd suggest getting familiar with a DOM parser, such as PHP's DOM module.
The DOM extension allows you to operate on XML documents through the DOM API with PHP 5.
Using the DOM module, in order to get the HTML/data from <code> tags in the document, you'd want to do something like this:
<?php
//So many variables!
$html = "<div> Testing some <code>code</code></div><div>Nother div, nother <code>Code</code> tag</div>";
$dom_doc = new DOMDocument;
$dom_doc->loadHTML($html);
$code = $dom_doc->getElementsByTagName('code');
foreach ($code as $scrap) {
echo htmlspecialchars($scrap->nodeValue, ENT_QUOTES), "<br />";
}
?>
Related
I am trying to create a custom CMS, every page has a unique ID and on every page is a string (<--UNIQUEID-->) at the place where the CMS text has to come.
I am trying to replace that string with the text that is saved in a database for that page, but I can't get that to work. I am trying this with DOM documents.
I have this at the moment:
This is before the <html>tag:
ob_start()
And after the </html>> tag:
if ((($html = ob_get_clean()) !== false) && (ob_start() === true))
{
$dom = new DOMDocument();
$dom->loadHTML($html); // load the output HTML
/* your specific search and replace logic goes here */
$StringToReplace = '<--754764-->';
$ReplacementString = 'test';
str_replace($StringToReplace, $ReplacementString, $html);
echo $dom->saveHTML(); // output the replaced HTML
}
It is showing the page, but it's not showing the replacement string text.
You're trying to do two things and getting confused in the process.
When you load your HTML buffered output into a DOMDocument object (via DOMDocument::loadHTML), the state of that object is now the parsed HTML. You then replace your string into $html itself, and then output the HTML from the DOMDocument.
Due to the fact that by the time you get to your str_replace call, the inner state of the DOMDocument is independent from $html, that replace call effectively does nothing to it.
If you're certain that the comment will be of exactly that form, you can just echo $html; after the call to str_replace. This also saves you from having to worry about your output being compliant and parsing properly (DOMDocument is stricter than most browsers when it comes to that).
The code you posted doesn't use the DOMDocument object to do any transformation of the document. It just parses the HTML then generate another one that is functionally identical to the original.
You just don't need the DOMDocument object.
The str_replace() does the expected transformation but the value it returns is completely ignored. You have to echo it in order to get the desired result.
The following code is enough:
if (($html = ob_get_clean()) !== false) {
/* your specific search and replace logic goes here */
$StringToReplace = '<--754764-->';
$ReplacementString = 'test';
echo str_replace($StringToReplace, $ReplacementString, $html);
}
I dont know what to research or where to start here.
What im trying to do is use PHP to read an HTML Page and pull out the raw text contained inside a div
the div is this
<div class="thingy">
test
</div>
When the php is executed, I want it to echo
Test
Is there an easy snippet for this, or can someone post a small script?
Edit: the html page with the Div is on another webpage.
What you're looking to do is parse HTML. Use the DOM module that comes with PHP to do this: http://php.net/manual/en/book.dom.php
You do NOT want to try to do this with regular expressions.
If you want to remove ALL the HTML tags from a document, use the PHP strip_tags() function: http://us3.php.net/strip_tags
While this could possibly be done using regex, I would recommend using a DOM parser. My reccommendation goes to SimpleHTML Dom Parser. Using it, here's how you would do what you want
$string = "<div class=\"thingy\">test</div>";
$html = str_get_html($string); // create the DOM object
$div = $html->find('div[class=thingy]', 0); // find the first div with a class of 'thingy'
echo $div->plaintext(); // echo the text contents
If you want to parse your html you can use it like
<?php
$str = '<div class="thingy">test</div>';
echo strip_tags($str);//OUTPUT : test
?>
As your html is on other webpage, start output buffering include that file in your main php script, do all manipulation on it to get the content.
I'm using PHP's "simplexml_load_file" to get some data from Flickr.
My goal is to get the photo url.
I'm able to get the following value (assigned to PHP variable):
<p>codewrecker posted a photo:</p>
<p><img src="http://farm3.static.flickr.com/2298/2302759205_4fb109f367_m.jpg" width="180" height="240" alt="Santa Monica Pier" /></p>
How can I extract just this part of it?
http://farm3.static.flickr.com/2298/2302759205_4fb109f367_m.jpg
Just in case it helps, here's the code I'm working with:
<?php
$xml = simplexml_load_file("http://api.flickr.com/services/feeds/photos_public.gne?id=19725893#N00&lang=en-us&format=xml&tags=carousel");
foreach($xml->entry as $child) {
$flickr_content = $child->content; // gets html including img url
// how can I get the img url from "$flickr_content"???
}
?>
You can probably get away with using a regular expression for this, assuming that the way the HTML is formed is pretty much going to stay the same, e.g.:
if (preg_match('/<img src="([^"]+)"/i', $string, $matches)) {
$imageUrl = $matches[1];
}
This is fairly un-robust, and if the HTML is going to change (e.g. the order of parameters in the <img> tag, risk of malformed HTML etc.), you would be better off using an HTML parser.
It's not solving your problem(and probably total overkill), but worth mentioning because I've used the library on 2 projects and it's well written.
phpFlickr - http://phpflickr.com/
Easy way: Combination of substr and strpos to extract first the tag and then the src='...' value, and finally the target string.
Slightly more difficult way (BUT MUCH MORE ROBUST): Use an XML parsing library such as simpleXML
I hope this is helpful. I enjoy using xpath to cut through the XML I get back from SimpleXML:
<?php
$xml = new SimpleXMLElement("http://api.flickr.com/services/feeds/photos_public.gne?id=19725893#N00&lang=en-us&format=xml&tags=carousel", NULL, True);
$images = $xml->xpath('//img'); //use xpath on the XML to find the img tags
foreach($images as $image){
echo $image['src'] ; //here is the image URL
}
?>
I've done this before but can't find my code snippet.
I'd like to parse an html file and pull everything into my browser that sits between some span tags. There are other span tags in the html that I do not want so I figured I would limit the parsing to just the span tags that have the same css class. Can someone please give me an example of how to do this? Thanks.
$tags = $doc->getElementsByTagName('span');
This is a single row of the html I am trying to parse
<span class='close'>test row</span>
First attempt (untested):
$elts = $doc->getElementsByTagName('span');
foreach ($elts as $elt)
{
$className = $elt->getAttribute('class');
if (array_search('close', explode(' ', $className)))
{
// Do things with $elt since it matches.
}
}
In my opinion and experience it can be done by getElementsByTagName() ?? Just use some ajax-y function to call for it and you have your DOM element :)
I am trying to create a simple alert app for some friends.
Basically i want to be able to extract data "price" and "stock availability" from a webpage like the folowing two:
http://www.sparkfun.com/commerce/product_info.php?products_id=5
http://www.sparkfun.com/commerce/product_info.php?products_id=9279
I have made the alert via e-mail and sms part but now i want to be able to get the quantity and price out of the webpages (those 2 or any other ones) so that i can compare the price and quantity available and alert us to make an order if a product is between some thresholds.
I have tried some regex (found on some tutorials, but i an way too n00b for this) but haven't managed to get this working, any good tips or examples?
$content = file_get_contents('http://www.sparkfun.com/commerce/product_info.php?products_id=9279');
preg_match('#<tr><th>(.*)</th> <td><b>price</b></td></tr>#', $content, $match);
$price = $match[1];
preg_match('#<input type="hidden" name="quantity_on_hand" value="(.*?)">#', $content, $match);
$in_stock = $match[1];
echo "Price: $price - Availability: $in_stock\n";
It's called screen scraping, in case you need to google for it.
I would suggest that you use a dom parser and xpath expressions instead. Feed the HTML through HtmlTidy first, to ensure that it's valid markup.
For example:
$html = file_get_contents("http://www.example.com");
$html = tidy_repair_string($html);
$doc = new DomDocument();
$doc->loadHtml($html);
$xpath = new DomXPath($doc);
// Now query the document:
foreach ($xpath->query('//table[#class="pricing"]/th') as $node) {
echo $node, "\n";
}
What ever you do: Don't use regular expressions to parse HTML or bad things will happen. Use a parser instead.
1st, asking this question goes too into details. 2nd, extracting data from a website might not be legitimate. However, I have hints:
Use Firebug or Chrome/Safari Inspector to explore the HTML content and pattern of interesting information
Test your RegEx to see if the match. You may need do it many times (multi-pass parsing/extraction)
Write a client via cURL or even much simpler, use file_get_contents (NOTE that some hosting disable loading URLs with file_get_contents)
For me, I'd better use Tidy to convert to valid XHTML and then use XPath to extract data, instead of RegEx. Why? Because XHTML is not regular and XPath is very flexible. You can learn XSLT to transform.
Good luck!
You are probably best off loading the HTML code into a DOM parser like this one and searching for the "pricing" table. However, any kind of scraping you do can break whenever they change their page layout, and is probably illegal without their consent.
The best way, though, would be to talk to the people who run the site, and see whether they have alternative, more reliable forms of data delivery (Web services, RSS, or database exports come to mind).
The simplest method to extract data from Website. I've analysed that my all data is covered within <h3> tag only, so I've prepared this one.
<?php
include(‘simple_html_dom.php’);
// Create DOM from URL, paste your destined web url in $page
$page = ‘http://facebook4free.com/category/facebookstatus/amazing-facebook-status/’;
$html = new simple_html_dom();
//Within $html your webpage will be loaded for further operation
$html->load_file($page);
// Find all links
$links = array();
//Within find() function, I have written h3 so it will simply fetch the content from <h3> tag only. Change as per your requirement.
foreach($html->find(‘h3′) as $element)
{
$links[] = $element;
}
reset($links);
//$out will be having each of HTML element content you searching for, within that web page
foreach ($links as $out)
{
echo $out;
}
?>