Regex PHP to get data from a website

Regex PHP to get data from a website - php

I want to scrap following data( pink color part in image ) from http://www.kitco.com/market/
I was able to scrap data from The World Spot Price - Asia/Europe/NY markets HTML Table below that table using following.. but not able to get the London Fix data.. what changes should i do in the regular expression below as i tried many combinations but it doesnt work
My code looks like the following
$html= get_url_contents("http://www.kitco.com/market/");
//echo $html;
preg_match_all('!Gold\s+([0-9.]+)\s+([0-9.]+)!i',$html,$matches);
$patt = "/<td[^>]*width=['\"]68['\"][^>]*>([0-9\.]+)<\/td>\s*<td[^>]*width=['\"]68['\"][^>]*>([0-9\.]+)<\/td>/i";

Please do not parse HTML with regular expressions (you can see why in this mandatory post).
That being said, you can use an HTML parser, such as the Simple HTML DOM Parser to process the table. Take a look at this previous SO post to get started in the right direction.
EDIT: As per your comment, you could try to do something like so: <td bgcolor=".+?">\s*<p>\s*(.+?)\s*</p>\s*</td>. I do however, advise against this approach.
This will match and put the values into regex groups, which you can then, later access.
NOTE: Also as per your comment, the regex you propose is also susceptible style changes, so if they change the width of the columns, your regex will most likely fail.

Related

How to display text in the same format as added in mysql text field using php echo

I want to display a large amount of text using a php echo command. I have that data stored in mysql database table in a text field. What i want to achieve is that the data should be displayed in the same manner in which i store it in the text field.
for example:
As entered in Mysql table by its Interface One reason people lie is to achieve personal power.
Achieving personal power is helpful for someone who pretends to be more confident than he really is. For example, one of my friends threw a party at his house last month. He asked me to come to his party and bring a date.
Although this lie helped me at the time, since then it has made me look down on myself.
Should be displayed exactly as the above rather than:
One reason people lie is to achieve personal power.
Achieving personal power is helpful for someone who pretends to be more confident than he really is. For example, one of my friends threw a party at his house last month. He asked me to come to his party and bring a date.
Although this lie helped me at the time, since then it has made me look down on myself.
Any ideas/tips on how this can be achieved?
I know that i can manually insert html tags between the text for formatting but i dont want to manually do so. Any way around?

nl2br($foo); will automatically add a <br> tag wherever there is a linebreak in $foo. You can echo nl2br($foo);.
As an alternative, try the <pre> tag. <pre><?php echo $foo; ?></pre>. You many need more styling, but it will preserve whitespace like your linebreaks.

My solution is:
I'm using GWT TextArea textAreaWidget widget.
Before insert the TextArea string to MySQL table I replace all line change and tab characters:
-new line
String toInsert=textAreaWidget.getText().replaceAll(Character.toString((char) 10), "\n\r"));
-tab
String toInsert=textAreaWidget.getText().replaceAll(Character.toString((char) 9), "\t"));
Example:
http://www.tutorialspoint.com/gwt/gwt_textarea_widget.htm

Using regex to return occurrences of multiple tags

I am trying to use regex to capture all the <div> and <span> tags in to a PHP array.
My code for getting single tag is:
[#<div>(.*?)</div>#i]
Single is not a problem, but im stuck trying to select two tags at once. My attempt is as follows:
[#<div>?<span>?(.*?)</div>?</span>?#i]
Any help will be appreciated.

Would a regex like this work for your purposes?
[#<(td|span)>(.*?)</(td|span)>#i]
The first and third capture groups would tell you the type, and the 2nd would contain the info you want to capture. Not sure what you mean by "select two tags at once" however... maybe nested?
http://rubular.com/r/ojYJjXFMZt
Using a proper parser is probably the way to go however as Ranhiru Cooray suggested.

How to filter data after using get contents

I want to know how to find a number on a remote website and make it a variable.
For example, if I want to find the stock quote for "AMZN", I would use curl or get contents on the page "http://stock-quotes.com/AMZN" to make it a variable string called $contents
Now that I have $contents, how would I find that AMZN quote? I was thinking of using a regular expression to narrow down the line, like finding "AMZN=35 points", and then perform another function to delete the "AMZN=" and " points" at the start and end of the string so that "35" is all that's left.
Is that how people do it?

1.) DOM Element
2.) Simple XML
3.) preg_match
4.) strpos

What I've always done (say in spidering, etc.) is to use the simple_html_dom library in PHP, then inspect the markup for the site.
The downside, as mentioned before, is that if the markup changes, you'll need to modify your code, but usually it's fairly easy, and if you use a source that has informative markup (consistent class names on the elements you need, etc.), then it's even easier.
Library link: http://simplehtmldom.sourceforge.net/

Improve a regex statement in order to be as efficient as it can be

I have a PHP program that, at some point, needs to analyze a big amount of HTML+javascript text to parse info.
All I want to parse needs to be in two parts.
Seperate all "HTML goups" to parse
Parse each HTML group to get the needed information.
In the 1st parse it needs to find:
<div id="myHome"
And start capturing after that tag. Then stop capturing before
<span id="nReaders"
And capture the number that comes after this tag and stop.
In the 2nd parse use the capture nº 1 (0 has the whole thing and 2 has the number) from the parse made before and then find
.
I already have code to do that and it works. Is there a way to improve this, make it easier for the machine to parse?
preg_match_all('%<div id="myHome"[^>]>(.*?)<span id="nReaders[^>]>([0-9]+)<"%msi', $data, $results, PREG_SET_ORDER);
foreach($results AS $result){
preg_match_all('%<div class="myplacement".*?[.]php[?]((?:next|before))=([0-9]+).*?<tbody.*?<td[^>]>.*?[0-9]+"%msi', $result[1], $mydata, PREG_SET_ORDER);
//takes care of the data and finish the program
Note: I need this for a freeware program so it must be as general as possible and, if possible, not use php extensions
ADD:
I ommitted some parts here because I didn't expect for answers like those.
There is also a need to parse text inside one of the tags that is in the document. It may be the 6th 7th or 8th tag but I know it is after a certain tag. The parser I've checked (thx profitphp) does work to find the script tag. What now?
There are more than 1 tag with the same class. I want them all. But I want only with also one of a list of classes.....
Where can I find instructions and demos and limitations of DOM parsers (like the one in http://simplehtmldom.sourceforge.net/)? I need something that will work on, at least, a big amount of free servers.
Another thing. How do I parse this part:
"php?=([0-9]+)"
with those HTML parsers?

If you're concerned about efficiency (and indeed accuracy), don't attempt to parse HTML using regex.
You should use a parser, such as PHP's DOM

As noted above, regex is not a good fit for this. You'll be better of using somethign like this:
Robust and Mature HTML Parser for PHP

Efficiency doesn't matter if your results are incorrect. Parsing HTML with regexes will lead to incorrect results down the road. Use a parser.

I found a way to create efficient searches.
If you want to search for "A huge string in a whole text" you can do it this way:
(?:(?:[^A]*A)+? huge string in a whole text)
It always works. Only creates a backtrace every 'A' character and not for every single character. Because of that it is not only memory efficient but processing power efficient too. If there are two options, it's also works without a problem:
(?:(?:[^AB]*AB)+?(?: huge string in a whole text|e the huge string in a whole text))
Up until now it has never failed.

PHP preg_replace - Don't match within h1 tags

I am using preg_replace to add a link to keywords if they are found within a long HTML string. I don't want to add a link if the keyword is found within h1 tags or strong tags.
The below regex nearly works and basically says (I think): If the keyword is not immediately wrapped by either a h1 tag or a strong tag then replace with the keyword that was matched, as a bolded link to google.
$result = preg_replace('%(?!<h1>)(?!<strong>)\b(bobs widgets)\b(?!<\/strong>)(?!<\/h1>)%i','<strong>$1</strong>', $result, -1);
(the reason I don't want to match if in strong tags is because I am recursing through a lot of keywords so don't want to link an already linked keyword on subsequent passes)
the above works fine and won't match:
<h1>bobs widgets</h1>
It will however match the keyword in the following text, because the h1 tag isn't immediately either side of the keyword:
<h1>Here are bobs widgets for sale</h1>
I need to make the spaces either side optional and have tried adding \s* but that doesn't get me anywhere. I'd be very grateful for a push in the right direction here.

Regular expressions are the wrong tool for this job. This has been discussed many times on Stack Overflow (such as the most famous thread on the site).
What you need is an HTML parser, such as the Simple HTML DOM Parser. Do yourself a favour and use something like this from the start. Imagine what's going to happen when you run into an <h1> where someone has added an attribute, or perhaps someone has improperly closed the tags, so you have a mixed up order on a </strong> and a </h1>. Getting things like that to work with a regular expression is not worth the trouble, and sometimes isn't even possible.

... just remember that eventually this approach will lead to sadness, and you'll need to start looking for a better approach. One way is to use 'tidy' to fix up your html into parseable xml, and then php offers a few xml manipulation APIs to work with the data.
Here's an answer anyway.
You can add some wildcards instead of the word boundaries. Something like this should do the trick:
([^<>]*)(bobs widgets)([^<>]*)
Then, add some more replacement markers to keep the remainder of your text in the output:
'$1<strong>$2</strong>$3'
Now hit save and hide behind the sofa ;)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Regex PHP to get data from a website - php

Related

How to display text in the same format as added in mysql text field using php echo

Using regex to return occurrences of multiple tags

How to filter data after using get contents

Improve a regex statement in order to be as efficient as it can be

PHP preg_replace - Don't match within h1 tags

Categories

Resources