Scraping Google Search Results in PHP

Scraping Google Search Results in PHP - php

I would like to get the links from the search results. Can someone please help with with the regular expression to do this? I've got this, and it doesn't work:
preg_match_all("/<h3(.*)><a href=\"(.*)\"(.*)<\/h3>/", $result, $matches);

Your patterns are likely having the biggest issues because of the greedy vs lazy nature of it. Changing it to the following should solve that issue...
preg_match_all('#<h3.*?><a href="(.*?)".*?</h3>#', $result, $matches);
print_r($matches[1]);
There are possibly a few rare URLs that could mess the pattern up, but chances are you won't run into one. I will point out that stillstanding has a good point though using the API would be a better option.
As for people that blanket answer with "You can't parse HTML with Regex, use a DOM"... Whilst you cannot create a generic HTML parser (and should be using DOM for that task), you can match patterns in a set of text you know follows a certain structure, the fact that structure is HTML is irrelevant. Yes, if Google change their layout it will probably break, but this is also probably true of a DOM Parser. (P.S. I'm well aware this will probably get down-voted by the sheeple).

Related

What is a good strategy for getting the title of a page?

... server side and using PHP.
I read this SO article on when to use regexes and it basically states that you can use regexes to parse HTML in certain cases.
<title></title>
should be easy to match.
I see no problem with this. I think the popular answer is voted so much for not b.c. of correctness but b.c. of entrainment value.
Is this O.K?

Yes, it is
/<title[^>]*>(.*?)<\/title>/is
Different people have different opinions, though. And you should only use regex if you know what you're doing.
This might me a very interesting read: When you should NOT use Regular Expressions?

Your best bet is to use an HTML parsing library (like this one), not regex. You may get away with using regex in this case, but it's like using a hammer to pound in a screw.
If you are looking for anything non-trivial in the HTML, regex is going to be very confusing and hard to read, and in many cases, regex cannot do the job without making many assumptions about the content of the HTML.

XML parser vs regex

What should I use?
I am going to fetch links, images, text, etc and use it for using it building seo statistics and analysis of the page.
What do you recommend to be used? XML Parser or regex
I have been using regex and never have had any problems with it however, I have been hearing from people that it can not do some things and blah blah blah...but to be honest I don't know why but I am afraid to use XML parser and prefer regex (and it works and serves the purpose pretty well)
So, if everything is working well with regex why am I here to ask you what to use? Well, I think that even though everything has been fine so far doesn't mean it will be in the future as well, so I just wanted to know what are the benifits of using a XML parser over regex? Are there any improvements in performances, less error prone, better support, other shine features, etc?
If you do suggest to use XML parser then which is recommended one to be used with PHP
I would most definitely like to know why would you pick one over the other?

What should I use?
You should use an XML Parser.
If you do suggest to use XML parser then which is recommended one to be used with PHP
See: Robust and Mature HTML Parser for PHP .

If you're processing real world (X)HTML then you'll need an HTML parser not an XML parser, because XML parsers are required to stop parsing as soon as they hit a well-formedness error, which will be almost immediately with most HTML.
The point against regex for processing HTML is that it isn't reliable. For any regex, there will be HTML pages that it will fail on. HTML parsers are just as easy to use as regex, and process HTML just like a browser does, so are very much more reliable and there's rarely any reason not to use one.
One possible exception is sampling for statistics purposes. Suppose you're going to scan 100,000 web pages for a fairly simple pattern, for example, the presence of a particular attribute, and return the percentage of matching pages that you get. While even a well designed regex will likely produce both false positives and false negatives, they are unlikely to affect the overall percentage score by very much. You may be able to accept those false matches for the benefit that a regex scan is likely to run more quickly than a full parse of each page. You can then reduce the number of false positives by running a parse only on the pages which return a regex match.
To see the kind of problems that will cause difficulties for regexes see: Can you provide some examples of why it is hard to parse XML and HTML with a regex?

It sounds to me as if you are doing screen-scraping. This is inevitably a somewhat heuristic process - you're looking for patterns that commonly occur in the web pages of interest, and you're inevitably going to miss a few of them, and you don't really mind. For example, you don't really care that your search for img tags will also find an img tag that happens to be commented out. If that characterizes your application, then the usual strictures against using regular expressions for processing HTML or XML might not apply to your case.

Regex (or better suggestion) on html with correct nesting

I've had a look and there don't seem to be any old questions that directly address this. I also haven't found a clear solution anywhere else.
I need a way to match a tag, open to close, and return everything enclosed by the tag. The regexes I've tried have problems when tags are nested. For example, the regex <tag\b[^>]*>(.*?)</tag> will cause trouble with <tag>Some text <tag>that is nested</tag> in tags</tag>. It will match <tag>Some text <tag>that is nested</tag>.
I'm looking a solution to this. Ideally an efficient one. I've seen solutions that involve matching on start and end tags separately and keeping track of their index in the content to work out which tags go together but that seems wildly inefficient to me (if it's the only possible way then c'est la vie).
The solution must be PHP only as this is the language I have to work with. I'm parsing html snippets (think body sections from a wordpress blog and you're not too far off). If there is a better than regex solution, I'm all ears!
UPDATE:
Just to make it clear, I'm aware regexes are a poor solution but I have to do it somehow which is why the title specifically mentions better solutions.
FURTHER UPDATE:
I'm parsing snippets. Solutions should take this into account. If the parser only works on a full document or is going to add <head> etc... when I get the html back out, it's not an acceptable solution.

As always, you simply cannot parse HTML with regex because it is not a regular language. You either need to write a real HTML parser, or use a real HTML parser (that someone's already written). For reasons that should be obvious, I recommend the latter option.
Relevant questions
Robust and Mature HTML Parser for PHP
How do you parse and process HTML/XML in PHP?

Why not just use DOMDocument::loadHTML? It uses libxml under the hood which is fast and robust.

Regular Expressions - Where Angels Fear to Tread

I've just started studying regular expressions in PHP, but I'm having a terrible time following some of the tutorials on the WWW and cannot seem to find anything addressing my current needs. Perhaps I'm trying to learn too much too fast. This aspect of PHP is entirely new to me.
What I'm trying to create is a regular expression to replace all HTML code in between the nth occurrence of <TAG> and </TAG> with any code I choose.
My ultimate goal is to make an Internet filter in PHP through which I can view a web page stripped of certain content (or replaced with sanitized content) between any specified set of tags <TAG>...</TAG> within the page, where <TAG>...</TAG> represents any valid paired HTML tags, such as <B>...</B> or <SPAN>...</SPAN> or <DIV>...</DIV>, etc, etc.
For example, if the page has a porn ad contained in the 5th <DIV>...</DIV> block within the page, what regular expression could be invoked to target and replace that code with something else, like xxxxxxx, but only the 5th <DIV> block within the page and nothing else?
The entire web page is contained within a single text string and the filtered result should also be a single string of text.
I'm not sure, but I think the code to do this could have a format similar to:
$FilteredPage = preg_replace("REG EXPRESSION", "xxxxxxxx", $OriginalPage);
The "REG EXPRESSION" to invoke is what I need to know and the "xxxxxxxx" represents the text to replace the code between the tags targeted by "REG EXPRESSION".
Regular expressions are obviously the work of Satan!
Any general suggestions or perhaps a couple of working examples which I could study and experiment with would be greatly appreciated.
Thanks, Jay

Firstly, are you using the right tool for the job? Regex is a text matching engine, not a fully blown parser - perhaps a dedicated HTML parser will give better results.
Secondly, when approaching any programming problem, try to simplify your problem and build it brick by brick rather than just jumping straight to a final solution. For example, you could:
Start with a simple block of normal english text, and try to match and replace (for example) every occurrence of the word "and".
When that works, wrap it in a loop of PHP that can count up to 5 and only replace the 5th occurrence. Why use regex to count when PHP is so much better at that task?
Then modify your regex to match your 5th HTML tag (which is a bit harder because <> are special characters and need escaping)
By approaching the problem in steps, you will be able to get each part working in turn and build a solid solution that you understand.

This has been done to death, but please, don't use a regex to parse HTML. Just stop, give up... It is not worth the kittens god will kill for you doing it. use a real HTML or XML parser
On a more constructive note, look at xpath as a technology better suited to describing html nodes you might want to replace... or phpQuery and QueryPath
The reason god kills kittens when you parse HTML with a regex:
Html is not a regular language, thus a regex can only ever parse very limited html. HTML is a context free language, and as such can only be properly parsed with a context free parser.
Edit: thank you #Andrew Grimm, this is said much better than i could, as evidenced by the first answer with well over four thousand upvotes!
RegEx match open tags except XHTML self-contained tags

ok, few ground rules.
Dont post a question like that, pre-ing all the question, will only keep people away
Regular expressions are awsome!
If you want to consider options, look on how to read html as an xml document and parse it using xpath
#tobyodavies is pretty much correct, I'll include the answer in case you want to do it anyways
Now, to your problem. With this one:
$regex = "#<div>(.+?)</div>#si";
You should be ok using that expression and counting the occurences, much like this:
preg_match_all($regex, $htmlcontent, $matches, PREG_SET_ORDER );
Suppose you only need the 5th one. Matches[$i][0] is the whole string of the $i-eth match
if (count($matches) > 5 )
{
$myMatch = $matches[5][0];
$matchedText = $matches[5][1];
}
Good luck in your efforts...

Preg_replace regex, newlines, connection resets

I have mixed html, custom code, and regular text I need to examine and change frequently on several, long wiki pages. I'm working with a proprietary wiki-like application and have no control over how the application functions or validates user input. The layout of pages that users add must follow a very specific standard layout and always include very specific text in only certain places - a standard which frequently changes. If users add pages that are so far out of the standard, they will be deleted.
I do not have the resources to manually proof-read and correct all these pages, so automation is the only solution. The fact that all this is obviously a complete waste of time when alternative platforms to do exactly what's needed here exist is already understood.
I've built a PHP based API to automate this post-validation and frequent restandardization process for me. I've been able set up regex patterns to handle all this mixed text, and they all work fine for handling single lines. The problem I have is this: Poorly formed regex against long text with line breaks can lead to unexpected results, such as connection resets. I have no access to server-side logs to troubleshoot. How do I overcome this?
This is just one example of what I currently have: {column} and {section} tags I'm searching for below can have any number of attributes, and wrap any text. {section} may or may not exist and may or may not be one or more lines under {column}, but it has to be wrapped inside {column}. {column} itself may or may not exist, and if it doesn't, I don't care as I then have some default text inserted later on down the script. I want to grab the inner section contents and wrap it in an html div tag instead. I can't recall the exact pattern I'm using offhand at the moment, but it's close enough...
$pattern = "/\{column:id=summary([|]?([a-zA-Z0-9-_ ]+[:][a-zA-Z0-9-_ ]+[ ]?))\}(.*)({section([|]([a-zA-Z0-9-_ ]+[:][a-zA-Z0-9-_ ]+[ ]?))\}(.*)\{section\}(.*))?{column\}/s";
$replacement = "{html}<div id='summary'>$7</div>{html}";
$text = preg_replace($pattern, $replacement, $subject);
Handling the {column} and {section} attributes and passing only valid HTML parameters to the new html div or a subtext of it is itself a challenge, but my main focus above right now is getting that (.*) value within {section} above without causing a connection reset. Any pointers?

This probably isn't what you're looking for, but: don't use a regex! You're trying to parse some very structured, very complex text, and to do so, you should really use a parser. I don't know what's available for PHP (you can Google just as well as I can, and I'm in no position to make any particular recommendation) but I'm sure something exists.
As for what's causing a connection reset, my only guess is that, since you mention problems with "long text", you're having a memory allocation issue. I don't think your regex will have unexpectedly huge performance, though it might in the non-matching case. But your best option, if you can, is probably to scrap the regex technique and switch to a real parser.

I found the likely source of the crashing issue: catastrophic backtracking (http://www.regular-expressions.info/catastrophic.html). So if refining patterns to handle that doesn't work (and if anyone has any patterns to suggest, please do share), switching to some other text parser solution would be best.

The only real problem I can see is all those (.*)s. In /s mode, each (.*) initially slurps up the whole page, only to have to backtrack most of the way. Change them all to (.*?) (i.e., switch to reluctant quantifiers) and it should work much faster.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Scraping Google Search Results in PHP - php

I would like to get the links from the search results. Can someone please help with with the regular expression to do this? I've got this, and it doesn't work: preg_match_all("/<h3(.)><a href=\"(.)\"(.*)<\/h3>/", $result, $matches);

Related

What is a good strategy for getting the title of a page?

XML parser vs regex

Regex (or better suggestion) on html with correct nesting

Regular Expressions - Where Angels Fear to Tread

Preg_replace regex, newlines, connection resets

Categories

Resources

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Scraping Google Search Results in PHP - php

I would like to get the links from the search results. Can someone please help with with the regular expression to do this? I've got this, and it doesn't work: preg_match_all("/<h3(.*)><a href=\"(.*)\"(.*)<\/h3>/", $result, $matches);

Related

What is a good strategy for getting the title of a page?

XML parser vs regex

Regex (or better suggestion) on html with correct nesting

Regular Expressions - Where Angels Fear to Tread

Preg_replace regex, newlines, connection resets

Categories

Resources

I would like to get the links from the search results. Can someone please help with with the regular expression to do this? I've got this, and it doesn't work: preg_match_all("/<h3(.)><a href=\"(.)\"(.*)<\/h3>/", $result, $matches);