Need to Fix Scrape PHP Script - php

We've a PHP script that scrapes search engine results pages and outputs clients website positions into a bespoke report suite for their domains.
Google changed something in the first week of February which prevented our script from detecting the domain on the page and I haven't currently got the original developer in the office nor can any of our other staff resolve this.
I pretty sure I know where the issue lies in the script, it's just, as I'm not a developer, I'm unsure what each line is actually doing. Our script uses the relevant classes from the search results to determine where what we're looking for is actually situated.
The script itself still runs and outputs the HTML fine. It's purely just the part of the script that says look for 'domain' on page that isn't being detected.
I appreciate that you're probably going to need a lot more information from me in order to advise what the issue is and I am happy to provide the file/coding as necessary. I would be prepared to pay for a fix on this too if necessary.
Below is where I feel the issue is occurring:-
// Note our use of ===. Simply == would not work as expected
// because the position of 'a' was the 0th (first) character.
if ($pos4 === false) {
$mystring5 = $val[0];
$findme5 = $prevlink;
$pos5 = #strpos($mystring5, $findme5);
// Note our use of ===. Simply == would not work as expected
// because the position of 'a' was the 0th (first) character.
if ($pos5 === false) {
$serp = $serp + 1;
echo '<b>'.$serp.'.</b> '.$val[0].'<br /><br />';
$link = get_string_between($val[1], 'href="', '" onmousedown');
$link = str_replace('https://','',$link);
$link = str_replace('http://','',$link);
$link = str_replace('www.','',$link);
$link;
$prevlink = $link;
$prevlink = str_replace(strstr($prevlink, '/'), "", $prevlink);
$sitelen = strlen($row_site_check['website_name']);
$sitefrom_link = substr($link, 0, $sitelen);
if ($sitefrom_link == $row_site_check['website_name']) {
$site_found = 1;
$rank_postion = $serp;
$site_link = $link;
$con = mysql_connect("localhost","dbname","dbpass");
if (!$con)
{
die('Could not connect: ' . mysql_error());
}
Any help would be greatly appreciated.
Thanks.

Check out the Google rank scraper (php, opensource)
I am using software based on it daily since it was released and there was no change of Googles layout in February that broke anything as far as I can tell.
I'm not sure if you'll like the answer but the reason is likely that the Rank Scraper I pasted uses DOM to parse the HTML of google while you seem to rely on regular expressions and string operations.
I've personally tried to make a scraper based on such methods in the past and found that it requires a lot of maintenance work to keep it running. Sometimes real ugly workarounds.
When using DOM small changes usually don't even damage anything and otherwise adapting the code might be easier.
In the past few years the DOM code of that parser was working without major interruption, only two times a small change had to be made. And Google did change a lot on their site in that time, it just didn't cause ill effects.
The DOM functions of the above linked checker can be found in the functions.php file
function process_raw($htmdata,$page)

Related

I've got a JSON file that I'm trying to extract values from with PHP

I'm trying to pull each of the viewer names from a JSON file in PHP.
I've looked around the Internet extensively for a working example that will offer me the result I desire without much success.
I'm really struggling to find an example that fits my needs on the Internet to help me with what is likely a very simple thing to accomplish.
I've got a JSON file that spits out several values on the Internet and I'm looking to extract every single line from one particular section.
Seeing a working example will likely help me understand what I am doing.
The JSON file that I am using for example is:
https://tmi.twitch.tv/group/user/dansgaming/chatters
I am trying to extract each single line from the "viewers" section in this file.
I've captured the data using the following PHP:
$testviewers = json_decode(#file_get_contents('https://tmi.twitch.tv/group/user/' . $streamName . '/chatters'), true);
var_dump($testviewers['chatters']['viewers']);
It turns out this isn't having the desired result for me.
I simply want each line in the viewer's section to be echoed out with page breaks.
What am I doing wrong? I've tried about two hundred different approaches to this one and have to admit this is my first real time working with JSON.
I've tried to search the Internet for answers and found many tutorials but none have made any sense to me and I know that seeing how to accomplish the result will help me learn exactly what should be going on.
In an ideal world, it will simply output each "viewer" on a separate line that I can work with. If I could echo each of them and then concatenate with a page break or the word "viewer:" before each one this would be a huge help and I'll be able to take it further and likely learn a great deal in the process.
this my way echo from json
$json = json_decode($response, true);
foreach($json['chatters'] as $key => $value)
{
if(!empty($value['viewers']))
{
$VIEWER = $value['viewers'];
$VIEWER = addslashes($VIEWER );
$VIEWER = trim(preg_replace('/\s\s+/', ' ', $VIEWER ));
}
else
{
$VIEWER = '';
}
echo 'VIEWER = '.$VIEWER .'</br>';
}
just make sure the foreach is true, maybe can help.
Turns out the issue here was a PHP error that wasn't displaying. The code was timing out because of how large the JSON file was and a low limit on my machine.
I think this is what you want....
$array = json_decode($your_json,true);
foreach($array['chatters']['viewers'] as $r) echo $r.'<br>';

PHP Advanced Regex Splitting

I'm facing a slight issue with an idea.
I use a chat feature within an online forum on all my computing devices. I also use it mobily, which causes slight issues of formatting, input, etc. I've had the idea to relay all the chat from a relay account to my own mobile friendly site.
I haven't started on sending messages yet, although I know how to read messages. How to output them is the issue.
I sniffed outgoing packets on my computer as the chat uses ajax. I was then able to find the following url: http://server05.ips-chat-service.com/get.php?room=xxxx&user=xxxx&access_key=xxxx
The page outputs something similar to this: ~~||~~1419344231,1,kondaxdesign,Could somebody send a quick message for me__C__ please?,,10248~~||~~1419344237,1,tom.bridges,its a iso and a vm what more do we need to know?,,10880~~||~~
That string would output this in chat: http://i.stack.imgur.com/j7CM6.png
I unfortunately don't have much knowledge on regex, or any other function that would split this. Would anybody be able to assist me on getting the 1). Name, 2). Chat Data and 3). Timestamp?
As you can see, the string is something like this: ~~||~~[timestamp],1,[name],[data],,[some integer]~~||~~
Cheers.
After reading through the string output, when somebody leaves chat, this is sent: ~~||~~1419344521,2,wegface,TIMEOUT,2_10828,0~~||~~
The beginning of the log starts with 1,224442 before the first ~~||~~.
You would first explode each record, then use str_getcsv to read the string and parse it as you want. Here is a script that does that, without any formatting on output, and I've named the variables as named in the OP that describes what they are.
I wouldn't use a regular expression to parse the string, as better functionality is available (linked above)
$string = "~~||~~1419344231,1,kondaxdesign,Could somebody send a quick message for me__C__ please?,,10248~~||~~1419344237,1,tom.bridges,its a iso and a vm what more do we need to know?,,10880~~||~~";
//Split so we have each chat record to loop around
foreach( explode("~~||~~", $string) as $segments) {
//Read the CSV properly
$chat = str_getcsv($segments);
if( count($chat) <> 6 ) { continue; } //Skip any that don't have all the data
$timestamp = $chat[0];
$name = $chat[2];
$data = $chat[3];
$some_integer = $chat[5];
echo $name .' said - '. $data .'<br />';
}

Alternative to php preg_match to pull data from an external website?

I want to extrat the content of a specific div in an external webpage, the div looks like this:
<dt>Win rate</dt><dd><div>50%</div></dd>
My target is the "50%". I'm actually using this php code to extract the content:
function getvalue($parameter,$content){
preg_match($parameter, $content, $match);
return $match[1];
};
$parameter = '#<dt>Score</dt><dd><div>(.*)</div></dd>#';
$content = file_get_contents('https://somewebpage.com');
Everything works fine, the problem is that this method is taking too much time, especially if I've to use it several times with diferents $content.
I would like to know if there's a better (faster, simplier, etc.) way to acomplish the same function? Thx!
You may use DOMDocument::loadHTML and navigate your way to the given node.
$content = file_get_contents('https://somewebpage.com');
$doc = new DOMDocument();
$doc->loadHTML($content);
Now to get to the desired node, you may use method DOMDocument::getElementsByTagName, e.g.
$dds = $doc->getElementsByTagName('dd');
foreach($dds as $dd) {
// process each <dd> element here, extract inner div and its inner html...
}
Edit: I see a point #pebbl has made about DomDocument being slower. Indeed it is, however, parsing HTML with preg_match is a call for trouble; In that case, I'd also recommend looking at event-driven SAX XML parser. It is much more lightweight, faster and less memory intensive as it does not build a tree. You may take a look at XML_HTMLSax for such a parser.
There are basically three main things you can do to improve the speed of your code:
Off load the external page load to another time (i.e. use cron)
On a linux based server I would know what to suggest but seeing as you use Windows I'm not sure what the equivalent would be, but Cron for linux allows you to fire off scripts at certain schedule time offsets - in the background - so not using a browser. Basically I would recommend that you create a script who's sole purpose is to go and fetch the website pages at a particular time offset (depending on how frequently you need to update your data) and then write those webpages to files on your local system.
$listOfSites = array(
'http://www.something.com/page.htm',
'http://www.something-else.co.uk/index.php',
);
$dirToContainSites = getcwd() . '/sites';
foreach ( $listOfSites as $site ) {
$content = file_get_contents( $site );
/// i've just simply converted the URL into a filename here, there are
/// better ways of handling this, but this at least keeps things simple.
/// the following just converts any non letter or non number into an
/// underscore... so, http___www_something_com_page_htm
$file_name = preg_replace('/[^a-z0-9]/i','_', $site);
file_put_contents( $dirToContainSites . '/' . $file_name, $content );
}
Once you've created this script, you then need to set the server up to execute it as regularly as you need. Then you can modify your front-end script that displays the stats to read from local files, this would give a significant speed increase.
You can find out how to read files from a directory here:
http://uk.php.net/manual/en/function.dir.php
Or the simpler method (but prone to possible problems) is just to re-step your array of sites, convert the URLs to file names using the preg_replace above, and then check for the file's existence in the folder.
Cache the result of calculating your statistics
It's quite likely this being a stats page that you'll want to visit it quite frequently (not as frequent as a public page, but still). If the same page is visited more often than the cron-based script is executed then there is no reason to do all the calculation again. So basically all you have to do to cache your output is do something similar to the following:
$cachedVersion = getcwd() . '/cached/stats.html';
/// check to see if there is a cached version of this page
if ( file_exists($cachedVersion) ) {
/// if so, load it and echo it to the browser
echo file_get_contents($cachedVersion);
}
else {
/// start output buffering so we can catch what we send to the browser
ob_start();
/// DO YOUR STATS CALCULATION HERE AND ECHO IT TO THE BROWSER LIKE NORMAL
/// end output buffering and grab the contents so we now have a string
/// of the page we've just generated
$content = ob_get_contents(); ob_end_clean();
/// write the content to the cached file for next time
file_put_contents($cachedVersion, $content);
echo $content;
}
Once you start caching things you need to be aware of when you should delete or clear your cache - otherwise if you don't your stats output will never change. With regards to this situation, the best time to clear your cache is at the point you go and fetch the external web pages again. So you should add this line to the bottom of your "cron" script.
$cachedVersion = getcwd() . '/cached/stats.html';
unlink( $cachedVersion ); /// will delete the file
There are other speed improvements you could make to the caching system (you could even record the modified times of the external webpages and load only when they have been updated) but I've tried to keep things easy to explain.
Don't use a HTML Parser for this situation
Scanning a HTML file for one particular unique value does not require the use of a fully-blown or even lightweight HTML Parser. Using RegExp incorrectly seems to be one of those things that lots of start-up programmers fall into, and is a question that is always asked. This has led to lots of automatic knee-jerk reactions from more experience coders to automatically adhere to the following logic:
if ( $askedAboutUsingRegExpForHTML ) {
$automatically->orderTheSillyPersonToUse( $HTMLParser );
} else {
$soundAdvice = $think->about( $theSituation );
print $soundAdvice;
}
HTMLParsers should be used when the target within the markup is not so unique, or your pattern to match relies on such flimsy rules that it'll break the second an extra tag or character occurs. They should be used to make your code more reliable, not if you want to speed things up. Even parsers that do not build a tree of all the elements will still be using some form of string searching or regular expression notation, so unless the library-code you are using has been compiled in an extremely optimised manner, it will not beat well coded strpos/preg_match logic.
Considering I have not seen the HTML you are hoping to parse, I could be way off, but from what I've seen of your snippet it should be quite easy to find the value using a combination of strpos and preg_match. Obviously if your HTML is more complex and might have random multiple occurances of <dt>Win rate</dt><dd><div>50%</div></dd> it will cause problems - but even so - a HTMLParser would still have the same problem.
$offset = 0;
/// loop through the occurances of 'Win rate'
while ( ($p = stripos ($html, 'win rate', $offset)) !== FALSE ) {
/// grab out a snippet of the surrounding HTML to speed up the RegExp
$snippet = substr($html, $p, $p + 50 );
/// I've extended your RegExp to try and account for 'white space' that could
/// occur around the elements. The following wont take in to account any random
/// attributes that may appear, so if you find some pages aren't working - echo
/// out the $snippet var using something like "echo '<xmp>'.$snippet.'</xmp>';"
/// and that should show you what is appearing that is breaking the RegExp.
if ( preg_match('#^win\s+rate\s*</dt>\s*<dd>\s*<div>\s*([0-9]+%)\s*<#i', $snippet, $regs) ) {
/// once you are here your % value will be in $regs[1];
break; /// exit the while loop as we have found our 'Win rate'
}
/// reset our offset for the next loop
$offset = $p;
}
Gotchas to be aware of
If you are new to PHP, as you state in a comment above, then the above may seem rather complicated - which it is. What you are trying to do is quite complex, especially if you want to do it optimally and fast. However, if you follow throught the code I've given and research any bits that you aren't sure of / haven't heard of (php.net is your friend), it should give you a better understanding of a good way to achieve what you are doing.
Guessing ahead however, here are some of the problems you might face with the above:
File Permission errors - in order to be able to read and write files to and from the local operating system you will need to have the correct permissions to do so. If you find you can not write files to a particular directory it might be that the host you are using wont allow you to do so. If this is the case you can either contact them to ask about how to get write permission to a folder, or if that isn't possible you can easily change the code above to use a database instead.
I can't see my content - when using output buffering all the echo and print commands do not get sent to the browser, they instead get saved up in memory. PHP should automatically output all the stored content when the script exits, but if you use a command like ob_end_clean() this actually wipes the 'buffer' so all the content is erased. This can lead to confusing situations when you know you are echoing something.. but it just isn't appearing.
(Mini Disclaimer :) I've typed all the above manually so you may find there are PHP errors, if so, and they are baffling, just write them back here and StackOverflow can help you out)
Instead of trying to not use preg_match why not just trim your document contents down in size? for example, you could dump everything before <body and everything after </body>. then preg_match will be searching less content already.
Also, you could try to do each one of these processes as a pseudo separate thread, so that way they aren't happening one at a time.

PHP Summarize any URL

How can I, in PHP, get a summary of any URL? By summary, I mean something similar to the URL descriptions in Google web search results.
Is this possible? Is there already some kind of tool I can plug in to so I don't have to generate my own summaries?
I don't want to use metadata descriptions if possible.
-Dylan
What displays in Google is (generally) the META description tag. If you don't want to use that, you could use the page title instead though.
If you don't want to use metadata descriptions (btw, this is exactly what they are for), you have a lot of research and work to do. Essentially, you have to guess which part of the page is content and which is just navigation/fluff. Indeed, Google has exactly that; note however, that extracting valuable information from useless fluff is their #1 competency and they've been researching and improving that for a decade.
You can, of course, make an educated guess (e.g. "look for an element with ID or class maincontent" and get the first paragraph from it) and maybe it will be OK. The real question is, how good do you want the results to be? (Facebook has something similar for linking to websites, sometimes the summary just insists that an ad is the main content).
The following will allow you to to parse the contents of a page's title tag. Note: php must be configured to allow file_get_contents to retrieve URLs. Otherwise you'll have to use curl to retrieve the page HTML.
$title_open = '<title>';
$title_close = '</title>';
$page = file_get_contents( 'http://www.domain.com' );
$n = stripos( $page, $title_open ) + strlen( $title_open );
$m = stripos( $page, $title_close);
$title = substr( $page, n, m-n );
While i hate promoting a service i have found this:
embed.ly
It has an API, that returns a JSON with all the data you need.
But i am still searching for a free/opensource library to do the same thing.

PHP's preg-match_all causing Apache Segfault

I'm using two regular expressions to pull assignments out of MySQL queries and using them to create an audit trail. One of them is the 'picky' one that requires quoted column names/etc., the other one does not.
Both of them are tested and parse the values out correctly. The issue I'm having is that with certain queries the 'picky' regexp is actually just causing Apache to segfault.
I tried a variety of things to determine this was the cause up to leaving the regexp in the code, and just modifying the conditional to ensure it wasn't run (to rule out some sort of compile-time issue or something). No issues. It's only when it runs the regexp against specific queries that it segfaults, and I can't find any obvious pattern to tell me why.
The code in question:
if ($picky)
preg_match_all("/[`'\"]((?:[A-Z]|[a-z]|_|[0-9])+)[`'\"] *= *'((?:[^'\\\\]|\\\\.)*)'/", $sql, $matches);
else
preg_match_all("/[`'\"]?((?:[A-Z]|[a-z]|_|[0-9])+)[`'\"]? *= *[`'\"]?([^`'\" ,]+)[`'\"]?/", $sql, $matches);
The only difference between the two is that the first one removes the question marks on the quotes to make them non-optional and removes the option of using different kinds of quotes on the value - only allows single quotes. Replacing the first regexp with the second (for testing purposes) and using the same data removes the issue - it is definitely something to do with the regexp.
The specific SQL that is causing me grief is available at:
http://stackoverflow.pastebin.com/m75c2a2a0
Interestingly enough, when I remove the highlighted section, it all works fine. Trying to submit the highlighted section by itself causes no error.
I'm pretty perplexed as to what's going on here. Can anyone offer any suggestions as to further debugging or a fix?
EDIT: Nothing terribly exciting, but for the sake of completeness here's the relevant log entry from Apache (/var/log/apache2/error.log - There's nothing in the site's error.log. Not even a mention of the request in the access log.)
[Thu Dec 10 10:08:03 2009] [notice] child pid 20835 exit signal Segmentation fault (11)
One of these for each request containing that query.
EDIT2: On the suggestion of Kuroki Kaze, I tried gibberish of the same length and got the same segfault. Sat and tried a bunch of different lengths and found the limit. 6035 characters works fine. 6036 segfaults.
EDIT3: Changing the values of pcre.backtrack_limit and pcre.recursion_limit in php.ini mitigated the problem somewhat. Apache no longer segfaults, but my regexp no longer matches all of the matches in the string. Apparently this is a long-known (from 2007) bug in PHP/PCRE:
http://bugs.php.net/bug.php?id=40909
EDIT4: I posted the code in the answers below that I used to replace this specific regular expression as the workarounds weren't acceptable for my purpose (product for sale, can't guarantee php.ini changes and the regexp only partially working removed functionality we require). Code I posted is released into the public domain with no warranty or support of any kind. I hope it can help someone else. :)
Thank you everyone for the help!
Adam
I have been hit with a similar preg_match-related issue, same Apache segfault. Only the preg_match that causes it is built-into the CMS I'm using (WordPress).
The "workaround" that was offered was to change these settings in php.ini:
[Pcre]
;PCRE library backtracking limit.
;pcre.backtrack_limit=100000
pcre.recursion_limit=200000000
pcre.backtrack_limit=100000000
The trade-off is for rendering larger pages, (in my case, > 200 rows; when one of the columns is limited to a 1500-character text description), you'll get pretty high CPU utilization, and I'm still seeing the segfaults. Just not as frequently.
My site's close to end-of-life, so I don't really have much need (or budget) to look for a real solution. But maybe this can mitigate the issue you're seeing.
Interestingly enough, when I remove the highlighted section, it all works fine. Trying to submit the highlighted section by itself causes no error.
What about size of the submission? If you pass gibberish of equal length, what will happen?
EDIT: splitting and merging will look something like this:
$strings = explode("\n", $sql);
$matches = array(array(), array(), array());
foreach ($strings AS $string) {
preg_match_all("/[`'\"]?((?:[A-Z]|[a-z]|_|[0-9])+)[`'\"]? *= *[`'\"]?([^`'\" ,]+)[`'\"]?/", $string, $matches_temp);
$matches[0] = array_merge($matches[0], $matches_temp[0]);
$matches[1] = array_merge($matches[1], $matches_temp[1]);
$matches[2] = array_merge($matches[2], $matches_temp[2]);
}
Given that this only needs to match against the queries when saving pages or performing other not very often-executed operations, I felt the performance hit of the following code was acceptable. It parses the SQL query ($sql) and places name=>value pairs into $data. Seems to be working well and handles large queries fine.
$quoted = '';
$escaped = false;
$key = '';
$value = '';
$target = 'key';
for ($i=0; $i<strlen($sql); $i++)
{
if ($escaped)
{
$$target .= $sql[$i];
$escaped = false;
}
else if ($quoted!='')
{
if ($sql[$i]=='\\')
$escaped = true;
else if ($sql[$i]==$quoted)
$quoted = '';
else
$$target .= $sql[$i];
}
else
{
if ($sql[$i]=='\'' || $sql[$i]=='`')
{
$quoted = $sql[$i];
$$target = '';
}
else if ($sql[$i]=='=')
$target = 'value';
else if ($sql[$i]==',')
{
$target = 'key';
$data[$key] = $value;
$key = '';
$value = '';
}
}
}
if ($value!='')
$data[$key] = $value;
Thank you everyone for the help and direction!

Categories