I am really new to php so still getting to grips.
I am using this bit of code to pull in world market feed.
<?php
$homepage = file_get_contents('http://www.news4trader.com/cgi-bin/google_finance.cgi?widget=worldmarkets');
echo $homepage;
?>
I just wanted to know how I can strip the google links out of it so the market titles are just static text.
All help is very much appreciated.
You can use the PHP function strip_tags() like this:
<?php
$homepage = file_get_contents('http://www.news4trader.com/cgi-bin/google_finance.cgi?widget=worldmarkets');
echo strip_tags($homepage, "<style><div><table><tr><td>");
?>
Just include all the tags you want to allow in the second argument.
You can use preg_replace() with a regex pattern to filter it out. This is simple, but not very flexible if you want to work more with your loaded data. PHP provides a nice library called DOMDocument (http://php.net/manual/de/class.domdocument.php), with which you can work very flexible on your document.
you could use "The DOMDocument class" it's used for exactly that.
http://php.net/manual/en/class.domdocument.php
you should have the basic idea of oop.
if you struggle with it, you could use strpos, and substr and such, but that would be hard.
strpos: http://php.net/manual/en/function.strpos.php
substr: http://php.net/manual/en/function.substr.php
you can use regex something like this:
/<a (.+google.+)>.+<\/a>/
This matches link that has any attribute or value with word google in it
Related
I'm grabbing data from a published google spreadsheet, and all I want is the information inside of the content div (<div id="content">...</div>)
I know that the content starts off as <div id="content"> and ends as </div><div id="footer">
What's the best / most efficient way to grab the part of the DOM that is inside there? I was thinking regular expression (see my example below) but it is not working and I'm not sure if it that efficient...
header('Content-type: text/plain');
$foo = file_get_contents('https://docs.google.com/spreadsheet/pub?key=0Ahuij-1M3dgvdG8waTB0UWJDT3NsUEdqNVJTWXJNaFE&single=true&gid=0&output=html&ndplr=1');
$start = '<div id="content">';
$end = '<div id="footer">';
$foo = preg_replace("#$start(.*?)$end#",'$1',$foo);
echo $foo;
UPDATE
I guess another question I have is basically about if it is just simpler and easier to use regex with start and end points rather than trying to parse through a DOM which might have errors and then extract the piece I need. Seems like regex would be the way to go but would love to hear your opinions.
Try changing your regex to $foo = preg_replace("#$start(.*?)$end#s",'$1',$foo); , the s modifier changes the . to include new lines. As it is, your regex would have to all the content between the tags on the same line to match.
If your HTML page is any more complex than that, then regex probably won't cut it and you'd need to look into a parser like DOMDocument or Simple HTML DOM
if you have a lot to do, I would recommend you take a look at http://simplehtmldom.sourceforge.net
really good for this sort of thing.
Do not use regex, it can fail.
Use PHP's inbuilt DOM parse :
http://php.net/manual/en/class.domdocument.php
You can easily traverse and parse relevant content .
I want to get the content between a code tag in a html document.
I tried forming it in preg_match...
Could anybody help me..
If you want to use preg_match, do:
preg_match("/<code>(.+?)<\/code>/is", $content, $matches);
Then access it with
$matches[1]
Though in general, you are going to find more use and better performance with a HTML Parser, which is the preferred method to Regular Expressions.
It's easier if you use phpQuery or QueryPath which allow:
print qp($html)->find("code")->text();
// looks for the <code> tag and prints the text content
If you want to try regular expressions for this, check out some of the tools listed in https://stackoverflow.com/questions/89718/is-there-anything-like-regexbuddy-in-the-open-source-world for help.
$text = file_get_contents('http://www.example.com/file.php?id=name');
echo preg_replace('#<a.*?>.*?</a>#i', '', $text)
the link contains this content:
text text text. <br><a href='http://www.example.com' target='_blank' title='title' style='text-decoration:none;'>name</a>
what is the problem at this script?
You can't parse HTML with regular expressions. Use an XML/HTML parser.
Tempted to flag your question, but there's no option for "Report user for summoning Cthulhu"
I'd recommend reading: http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html
RegEx is very poor and not at all intended to parse HTML. That's why there are HTML parsing libraries. Find and use one for PHP. :)
use <a[^>]+>[^<]*</a> (works fine as long as theres just text and no tags inside the a element)
USE strip_tags this way
$t = 'http://yoururl.com/test1.php';
$t1 = file_get_contents($t);
$text = strip_tags($t1);
it should work getting rid of all the links inside the page you are reading, visit the reference anyway, it may not work for complicated elements http://php.net/manual/en/function.strip-tags.php
What's wrong with my code?
I wish to get all dates from
but my array is empty.
<?php
$url = "http://weather.yahoo.com/";
$page_all = file_get_contents($url);
preg_match_all('#<div id="myLocContainer">(.*)</div>#', $page_all, $div_array);
echo "<pre>";
print_r($div_array);
echo "</pre>";
?>
Thanks
You want to parse a multiline content but you did not use multiline switch of REGEX pattern.
Try using this:
preg_match_all('#<div id="myLocContainer">(.*?)</div>#sim', $page_all, $div_array);
Please note that regular expressions is not suitable for parsing HTML content because of the hierachical nature of HTML documents.
try adding "m" and "s" modifiers, new lines might be in the div you need.. like this:
preg_match_all('#<div id="myLocContainer">(.*)</div>#ms', $page_all, $div_array);
Before messing around with REGEX, try HTML Scraping. This HTML Scraping in Php might give some ideas on how to do it in a more elegant and (possibly) faster way.
$doc = new DomDocument;
$doc->Load('http://weather.yahoo.com/');
$doc->getElementById('myLocContainer');
you need to Excape Special Characters in your Regular Expression like the following
~\<div id\=\"myLocContainer\"\>(.*)\<\/div\>~
also Checkout wheather there is a newline problem or not as mentioned by #eyazici and #kgb
Test your response before running the regex search. Then you'll know which part isn't working.
I'm trying to make a simple php script to find all src attributes from all images in a html text and then replace all found srcs with some text after making some conditional changes.
Something like this:
#preg_match_all('/<img\s src="([a-zA-Z0-9\.;:\/\?&=_|\r|\n]{1,})"/isxmU', $body, $images);
now i've all srcs into the $images variable, now i make:
foreach ($images as $img) {
..my changes here..
}
and now... how can i restore the changed srcs to the $body variable again??
many thanks in advance,
You should look into preg_replace_callback(), which will allow you to postprocess each match however you like, using a callback function. (You would use it instead of your preg_match_all(), not in addition to it.)
Use a HTML DOM parser instead, much easier to use and maintain http://simplehtmldom.sourceforge.net/
I asked a question yesterday about a good interface for modifying and traversing HTML files. You may be interested in this:
jQuery port to PHP
This may be a good alternative if you are already familiar with jQuery's API.
A non-validating parser may be even better if you need to work with badly formed HTML.
http://pear.php.net/package/XML_HTMLSax3
I think the easiest answer you're looking for is to do a str_replace.
foreach ($images as $img) {
..my changes here..
$body = str_replace($original_string, $modified_string, $output_body);
}
Don't what you want is to use preg_replace? With the e modifier the replacement text is eval'd so you can have a function that do on the text-to-be-replaced the same thing that you would have done in your foreach loop.
EDIT: preg_replace_callback is cleaner than using the e modifier with preg_replace, didn't thought of that while writing my anser, so chaos answer is better.