Scrape only x amount of characters - how?

Scrape only x amount of characters - how? - php

BACKGROUND
I own a website that indexes all psychologists of Denmark.
My site provides contact information for all the clinics as well as user ratings.
I'm currently listing 12.000 Psychologists, of which about 6.000 have a website. About 1000 of the Psychologists have visited my website, and filled out their profile with additional "Descriptive" info (such as opening hours, prices, etc.)
I'm attempting to automatically scrape (with PHP and RegEx) the sites of those who haven't provided details to my community, for informative reasons.
I went through about a good random 150 of the websites, and concluded that more than 85 % af them, have valuable text proceeding the word 'Velkommen' (=welcome, in Denish). PRECIOUS!
THE QUESTIONS
#1
How do I specificy in my script, that I'd only like to grab approx. 360 characters, and nothing more. Ofc. this should be preceeding (and including) the word Velkommen. Also, the script shouldn't be case sensitive (though Velkommen is usually spelled with a capital V, it can pop up in another sentence.)
Also, it should the last occuring 'velkommen' on the whole frontpage, since it sometimes occurs as a Menu/Navigation option, which would suck, since i'd then grab the navigation options.
#2
Currently - my script saves info in arrays, and then in the database.
Not sure how I should even go about this. What would be optimal for SEO;
Save the scraped text in a MySQL and display that every time.
Render the same 360-characters-text every time [that follows 'Velkommen']
Render random 360-characters-text from the sites, each time someone views a specific Psychologist on my site.
An example site:
$web = "http://www.psykologdorthelau.dk/";
$website = file_get_contents ($web);
preg_match_all("/velkommen.+?/sim", $website, $information);
//THIS SHOULD SPECIFICY THE VERY LAST 'VELKOMMEN' - it doesn't, I know :(
for($i = 0; $i < count($information[0]); $i++){
preg_match_all("/Velkommen (.+?)\"/sim", $information[0][$i], $text, PREG_SET_ORDER);
$psychologist[$i]['text'] = mysql_real_escape_string($text[0][1]);
}
Thank you to anyone who can solve this puzzle, from the wonderful country of Denmark.

When you want to fetch only a certain amount of data you can use a filestream.
It would look something like this:
$handle = fopen("http://www.example.com/", "r"); // open a filestream
// Fetch for example only 10 bytes each time we check
$chunkSize = 10;
$contents = "";
while ( !feof( $handle ) && strlen($contents) < 360) {
$buffer = fread( $handle, $chunkSize );
$contents .= $buffer;
}
$status = fclose( $handle );
//your data is stored in $contents

"the scraped data should be preceeding the word 'velkommen'":
preg_replace_callback('/velkommen(.*){360}/i',
function($matched) {
// Use $matched[1] to perform further testing
},
$contents
);
It's hacky, but it will get you started. Requires PHP 5.4 I believe.

Related

fREAD a text file to count amount of name values - PHP

I've created a form which users input numbers into, this data is then being written to a text file using fwrite.
Now my question is, is there a way to read the file in a sense, but only a certain post.. then count up how many of those have occurred, for example...
$data = sprintf("((%s,%s,%s))$s",
$_POST['shapeType'],
$_POST['circleRadius'],
$_POST['circleColour'],
PHP_EOL ); // automatically use the Operating System appropriate new line character sequence.
fwrite($handle, $data); }
fclose($handle);
?>
above is the fwrite, now 'shapeType' is circle on this specific write, is there a way to locate all the shapeType posts (other shapes like square etc..) therefore producing a
There are x amount of Shapes stored within the site.
x obviously replacing the counted amount, any ideas? im quite new to this so it may be impossible altogether.
update!! - what text file looks like
((Circle,120,Red))((Triangle,190,120,90,Blue))
((Circle,90,Blue))((Circle,20,Red))

You can read each line of the file into an array and then loop through that array. While you're looking at each line, just check to see if it matches 'circle' or whatever and if so, increment a counter.
<?php
$shape_type = $_POST['shapeType'];
$shape_counter = 0;
$lines = file($file_name);
foreach ($lines AS $line) {
if (preg_match('/'.$shape_type.'/', $line)) {
$shape_counter++;
}
}
print "There are ".$shape_counter." amount of ".$shape_type."s stored within the site.";
EDIT:
If you're just wanting to get the total number of lines, it's even easier. Just do this:
<?php
$lines = file($file_name);
print "There are ".count($lines)." amount of shapes stored within the site.";

Reading a log file into an array reversed, is it best method when looking for keyword near the bottom?

I am reading from log files which can be anything from a small log file up to 8-10mb of logs. The typical size would probably be 1mb. Now the key thing is that the keyword im looking for is normally near the end of the document, in probably 95% of the cases. Then i extract 1000 characters after the keyword.
If i use this approach:
$lines = explode("\n",$body);
$reversed = array_reverse($lines);
foreach($reversed AS $line) {
// Search for my keyword
}
Would it be more efficent than using:
$pos = stripos($body,$keyword);
$snippet_pre = substr($body, $pos, 1000);
What i am not sure on is with stripos does it just start searching through the document 1 character at a time so in theory if there is 10,000 characters after the keyword then i wont have to read those into memory, whereas the first option would have to read everything into memory even though it probably only needs the last 100 lines, could i alter it to read 100 lines into memory, then search another 101-200 lines if the first 100 was not successful or is the query so light that it doesnt really matter.
I have a 2nd question and this assumes the reverse_array is the best approach, how would i extract the next 1000 characters after i have found the keyword, here is my woeful attempt
$body = $this_is_the_log_content;
$lines = explode("\n",$body);
$reversed = array_reverse($lines);
foreach($reversed AS $line) {
$pos = stripos($line,$keyword);
$snippet_pre = substr($line, $pos, 1000);
}
Why i don't think that will work is because each $line might only be a few hundred characters so would the better solution be to explode it every say 2,000 lines and also keep the previous $line as a backup variable so something like this.
$body = $this_is_the_log_content;
$lines = str_split($body, 2000);
$reversed = array_reverse($lines);
$previous_line = $line;
foreach($reversed AS $line) {
$pos = stripos($line,$keyword);
if ($pos) {
$line = $previous_line . ' ' . $line;
$pos1 = stripos($line,$keyword);
$snippet_pre = substr($line, $pos, 1000);
}
}
Im probably massively over-complicating this?

I would strongly consider using a tool like grep for this. You can call this command line tool from PHP and use it to search the file for the word you are looking for and do things like give you the byte offset of the matching line, give you a matching line plus trailing context lines, etc.
Here is a link to grep manual. http://unixhelp.ed.ac.uk/CGI/man-cgi?grep
Play with the command a bit on the command line to get it the way you want it, then call it from PHP using exec(), passthru(), or similar depending on how you need to capture/display the content.
Alternatively, you can simply fopen() the file with the pointer at the end and move the file pointer forward in the file using fseek() searching for the string as you move along the way. Once you find you needle, you can then read the file from that offset until you get to the end of file or the number of log entries.
Either of these might be preferable to reading the entire log file into memory and then trying to work with it.
The other thing to consider is whether 1000 characters is meaningful. Typically log files would have lines that vary in length. To me it would seem that you should be more concerned about getting the next X lines from the log file, not the next Y characters. What if a line has 2000 characters, are you saying you only want to get half of it? That may not be meaningful at all.

Modify a line with "KEY - AMOUNT" of a file in PHP

this has been bugging me for ages now but i can't figure it out..
Basically i'm using a hit counter which stores unique IP address in a file. But what i'm trying to do is get it to count how many hits each IP address has made.
So instead of the file reading:
222.111.111.111
222.111.111.112
222.111.111.113
I want it to read:
222.111.111.111 - 5
222.111.111.112 - 9
222.111.111.113 - 41
This is the code i'm using:
$file = "stats.php";
$ip_list = file($file);
$visitors = count($ip_list);
if (!in_array($_SERVER['REMOTE_ADDR'] . "\n", $ip_list))
{
$fp = fopen($file,"a");
fwrite($fp, $_SERVER['REMOTE_ADDR'] . "\n");
fclose($fp);
$visitors++;
}
What i was trying to do is change it to:
if (!in_array($_SERVER['REMOTE_ADDR'] . " - [ANY NUMBER] \n", $ip_list))
{
$fp = fopen($file,"a");
fwrite($fp, $_SERVER['REMOTE_ADDR'] . " - 1 \n");
fclose($fp);
$visitors++;
}
else if (in_array($_SERVER['REMOTE_ADDR'] . " - [ANY NUMBER] \n", $ip_list))
{
CHANGE [ANY NUMBER] TO [ANY NUMBER]+1
}
I think i can figure out the last adding part, but how do i represent the [ANY NUMBER] part so that it finds the IP whatever the following number is?
I realise i'm probably going about this all wrong but if someone could give me a clue i'd really appreciate it.
Thanks.

This is bad idea, don't do it this way.
Its normal to store website statics in the file-system but not with pre-aggregation applied to it.
If you going to use the file-system then do post-aggregation on the data otherwise use a database.

What you are doing is a very bad idea
But lets first answer the actual question you are asking.
To be able to do that you will have to actually process the file first in some kind of data structure that allows for that to be done. I'd presonally recommend an array in the form of IP => AMOUNT.
For example (untested code):
$fd = file($file);
$ip_list = array();
for ($fd as $line) {
list($ip, $amount) = explode("-", $line);
$ip_list[$ip] = $amount;
}
Note that the code is not perfect as it would leave a space at the end of $ip and another in front of $amount due to the nature of your original data. But it works good enough just to point you in the right direction. A more "accurate" solution would involve regular expressions or modifying the original data source to a more convenient format.
Now the real answer to your actual problem
Your process will quickly become a performance bottleneck as you would have to open up that file, process it and write it all back again afterwards (not sure if you can do in-line editing of an open file) for every request.
As you are trying to do some kind of per-IP hit count, there are a lot of better solutions to your problem:
Use an existing solution for it (like piwik)
Use an actual database for your data
Keep your file simple with just a list of IPs and post-process it off-line periodically to make it be the format you want
You can avoid writing that file altogether if you have access to your webserver's logs (and they are setup to log every request with the originating IP) and you can post-process that file instead

in_array() simply does a basic a string match. it will NOT look for substrings. Ignoring how bad an idea it is to use a flat file for data storage, what you want is preg_grep, which allows you to use regexes
$ip_list = file('ips.txt');
$matches = preg_replace('/^\d+\.\d+\.\d+\.\d+ - \d+$/', $ip_list);
of course, this is a very basic and very broken IP address match, and will not help you actually CHANGE the value in $ip_list, because you don't get the actual index(es) of the matched lines.

Searching through very large files with php to extract a block very efficiently

I've been having a major headache lately with parsing metadata from video files, and found part of the problem is a disregard of various standards (or at least differences in interepretation) by video-production software vendors (and other reasons).
As a result I need to be able scan through very large video (and image) files, of various formats, containers and codecs, and dig out the metadata. I've already got FFMpeg, ExifTool Imagick and Exiv2 each to handle different types of metadata in various filetypes and been through various other options to fill some other gaps (please don't suggest libraries or other tools, I've tried them all :)).
Now I'm down to scanning the large files (upto 2GB each) for an XMP block (which is commonly written to movie files by Adobe suite and some other software). I've written a function to do it, but I'm concerned it could be improved.
function extractBlockReverse($file, $searchStart, $searchEnd)
{
$handle = fopen($file, "r");
if($handle)
{
$startLen = strlen($searchStart);
$endLen = strlen($searchEnd);
for($pos = 0,
$output = '',
$length = 0,
$finished = false,
$target = '';
$length < 10000 &&
!$finished &&
fseek($handle, $pos, SEEK_END) !== -1;
$pos--)
{
$currChar = fgetc($handle);
if(!empty($output))
{
$output = $currChar . $output;
$length++;
$target = $currChar . substr($target, 0, $startLen - 1);
$finished = ($target == $searchStart);
}
else
{
$target = $currChar . substr($target, 0, $endLen - 1);
if($target == $searchEnd)
{
$output = $target;
$length = $length + $endLen;
$target = '';
}
}
}
fclose($handle);
return $output;
}
else
{
throw new Exception('not found file');
}
return false;
}
echo extractBlockReverse("very_large_video_file.mov",
'<x:xmpmeta',
'</x:xmpmeta>');
At the moment it's 'ok' but I'd really like to get the most out of php here without crippling my server so I'm wondering if there is a better way to do this (or tweaks to the code which would improve it) as this approach seems a bit over the top for something as simple as finding a couple of strings and pulling out anything between them.

You can use one of the fast string searching algorithms - like Knuth-Morris-Pratt
or Boyer-Moore in order to find the positions of the start and end tags, and then read all the data between them.
You should measure their performance though, as with such small search patterns it might turn out that the constant of the chosen algorithm is not good enough for it to be worth it.

With files this big, I think that the most important optimization would be to NOT search the string everywhere. I don't believe that a video or image will ever have a XML block smack in the middle - or if it has, it will likely be garbage.
Okay, it IS possible - TIFF can do this, and JPEG too, and PNG; so why not video formats? But in real world applications, loose-format metadata such as XMP are usually stored last. More rarely, they are stored near the beginning of the file, but that's less common.
Also, I think that most XMP blocks will not have sizes too great (even if Adobe routinely pads them in order to be able to "almost always" quickly update them in-place).
So my first attempt would be to extract the first, say, 100 Kb and last 100 Kb of information from the file. Then scan these two blocks for "
If the search does not succeed, you will still be able to run the exhaustive search, but if it succeeds it will return in one ten-thousandth of the time. Conversely, even if this "trick" only succeeded one time in one thousand, it would still be worthwhile.

Can anyone explain how this scrambling function works?

I'm working with a function taken from Corrupt (a web based piece of software used to get "glitchy" effects using jpeg images). This function can be found in the corrupt.php file on line 23. At the moment it's not making the files glitchy enough. I made this images to show you how I want the images to look. This was made by opening the jpeg in a text editor and cutting certain lines and pasting them in other places.
I want this function to do a similar thing but at the moment it doesn't. Any ideas? Is there a better way of doing this maybe?
function scramble($content, $size) {
$sStart = 10;
$sEnd = $size-1;
$nReplacements = rand(1, 30);
for($i = 0; $i < $nReplacements; $i++) {
$PosA = rand($sStart, $sEnd);
$PosB = rand($sStart, $sEnd);
$tmp = $content[$PosA];
$content[$PosA] = $content[$PosB];
$content[$PosB] = $tmp;
}
return($content);
}

It is randomly swapping information around in the data arrays loaded from your image. This causes a valid image to come out with invalid image information in some sectors. Also, image files sometimes contain additional information at the front/end of the file; this does not look like it takes that into account and could corrupt that information as well.
To increase the amount of swaps you will want to increase the number of replacements. The bit of code you are particularly interested in is rand(1, 30);; I would suggest increasing the minimum amount of scramble first and then the upper range if you still do not get the desired effect.

The function does random swaps between the elements of the array. The number of swaps is a randomly generated number from 1 to 30.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Scrape only x amount of characters - how? - php

"the scraped data should be preceeding the word 'velkommen'": preg_replace_callback('/velkommen(.*){360}/i', function($matched) { // Use $matched[1] to perform further testing }, $contents ); It's hacky, but it will get you started. Requires PHP 5.4 I believe.

Related

fREAD a text file to count amount of name values - PHP

Reading a log file into an array reversed, is it best method when looking for keyword near the bottom?

Modify a line with "KEY - AMOUNT" of a file in PHP

Searching through very large files with php to extract a block very efficiently

Can anyone explain how this scrambling function works?

Categories

Resources