PHP website data mining Preg_Match Undefined Offset

PHP website data mining Preg_Match Undefined Offset - php

I'm working on a PHP project for school. The task is to build a website to grab and analyze data from another website. I have the framework set up, and I am able to grab certain data from the desired site, but I can't seem to get the syntax right for other data that I need to obtain.
For example, the site that I am currently analyzing is a page for a specific item returned from a search of Amazon.com (e.g. search amazon.com for "iPad" and pick the first result). I am able to grab the title of the product's page, but I need to grab the review count and the price, and therein lies the issue. I'm using preg_match to get the title (works fine), but I'm not able to get the reviews nor the price. I continue to get the Undefined Offset error, which I've discovered means that there is nothing being returned that matches the given criterion. Simply checking to see whether something has been returned will not help me, since I need to obtain these data for my analysis. The 's that I'm trying to mine are unique on the page, so there is only one instance of each.
The Page Source for my product page contains the following snippits of HTML that I need to grab. (The website can, and needs to be able to handle, anything, but for this example, I searched "iPad").
<span id="priceblock_ourprice" class="a-size-medium a-color-price">$397.74</span>
I need the 397.74.
<span id="acrCustomerReviewText" class="a-size-base">1,752 customer reviews</span>
I need the 1,752.
I've tried all combinations of escape characters, wildcards, etc., but I can't seem to get beyond the Undefined Offset error. An example of my code is as follows where $link is the URL, and $f is an empty array in which I want to store the result (Note: There is NOT a space after the '<' in "< span..." It just erased everything up to the "...(.*)..." when I typed it as "< span..." without the space):
preg_match("#\< span id\=\"priceblock\_ourprice\" class\=\"a\-size\-medium a\-color\-price\"\>(.*)\<\/span\>#", file_get_contents($link), $f);
$price=$f[1]; //Offset error occurs on this line
echo $price;
Please help. I've been beating my head against this for the past two days now. I'm hoping I'm just doing something stupid. This is my first experience with preg_match and data mining. Thank you much in advanced for your time and assistance.

Code
As stated by #cabellicar123, you shouldn't use regex with html.
I believe what you are looking for is strpos() and substr(). It should look something like this:
function get_content($string, $begintag, $endtag) {
if (strpos($string, $begintag) !== False) {
$location = strpos($string, $begintag) + strlen($begintag);
$leftover = substr($string, $location);
$contents = substr($leftover, 0, strpos($leftover, $endtag));
return $contents;
}
}
// Usage (Change the variables):
$str = file_get_contents('http://www.amazon.com/OLB3-Official-League-Recreational-Ball/dp/B004KOBRMC/');
$beg = '<b class="priceLarge">$';
$end = '</b>';
get_content($str, $beg, $end);
I've provided a working example which would return the price of the object on the page, in this case, the price of a rawlings baseball.
Explanation
I'll go through the code, line by line, and explain every piece.
function get_content($string, $begintag, $endtag)
$string is the string being searched through (in this case an amazon page), $begintag is the opening tag of the element being searched for, and $closetag is the closing tag of that element. NOTE: This will only use the first instance of the opening tag, more than that will be ignored.
if (strpos($string, $begintag) !== False)
Checks if the beginning tag actually exists. Note the !== False; that's because strpos can return 0, which evaluates to False.
$location = strpos($string, $begintag) + strlen($begintag);
strpos() will return the first instance of $begintag in $string, therefore the length of the $begintag must be added to the strpos() to get the location of the end of $begintag.
$leftover = substr($string, $location);
Now that we have the $location of the opening tag, we need to narrow the $string down by setting $leftover to the part of the $string after $location.
$contents = substr($leftover, 0, strpos($leftover, $endtag));
This gets the position of the $endtag in $leftover, and stores everything before that $endtag in $contents.
As for the last few lines of code, they are specific to this example and just need to be changed to fit the circumstances.

Related

I am getting a error using php (str_replace)

I am getting a error while using php str_replace function.
I am reading out a string in a different file a JSON
and if I remove the str_replace part it works without the error but I want to make the ** go to bold if there are any other ways you know you can also just tell that.
<?php
$data = json_decode($readjson, true);
echo "<br/><br<br/>";
foreach ($data as $emp) {
echo str_replace("**","<strong>","$data"), $emp['message']."<br/>";
}
?>
And the output is
Notice: Array to string conversion in C:\Users\k-ver\Dropbox\Other\website stuff or smth\r3mind3r\changelog.php on line 16
Array - Weekley Update - Another great week at our side! We have made enournous advances in synching with the raspberry pi (the computer we are going to host from) and are closer than ever to our promised release We have also been fixing on the mute commands and are very close to making it work, aswell with unmute command.language feature is closing up on complete and about 70% of the bot has the language system working. We also made a new system that should be easier to use for bouth us devs and the translators. All thats left for the release atm is: -finishing synching -fixing the mute command and unmute command -make a functioning permissonlevel system -adding those last 30% of the bot that does not have the translationsystem in place. and the bot will have its huge release! (about time if you asked me)
(it is for a dev log)
and the part I don't understand is the notice and I also don't understand how to fix it
It would be awesome if you guys would like to help me.

This variable should be string value e.g $emp['message'] not the multi-dimensional array $data.
// see this line with $emp['message'] not $data array
str_replace("**","<strong>",$emp['message']);
EDIT: As per comment
<?php
$string = '{
"188762891137056769": {
"message": "\n**- Weekley Update -**\nAnother great week at our side! \nWe have made enournous advances in synching with the raspberry pi *(the computer we are going to host from)* and are closer than ever to our promised release\n\nWe have also been fixing on the mute commands and are very close to making it work, aswell with unmute command.language feature is closing \nup on complete and about 70% of the bot has the language system working. We also made a new system that should be easier to use for bouth \nus devs and the translators.\n\n**All thats left for the release atm is:**\n-finishing synching \n-fixing the mute command and unmute command \n-make a functioning permissonlevel system\n-adding those last 30% of the bot that does not have the translationsystem in place. \n\nand the bot will have its huge release! *(about time if you asked me)*"
}
}';
$array = json_decode($string,1);
$message = $array['188762891137056769']['message'];
$re = '/\*\*(.*?)\*\*/m';
$subst = '<strong>$1</strong>';
echo preg_replace($re, $subst, $message);
?>
DEMO: https://3v4l.org/ovhGq

You used array inside str_replace("**","","$data") this is wrong, how you can fix it just replace $data with $emp

Your code is:
foreach ($data as $emp) {
echo str_replace("**","<strong>","$data"), $emp['message']."<br/>";
}
You see $data is an array, an $emp is the current element within the foreach loop.
So, you should do: str_replace("", "", $emp)
By the way, I see this: $emp['message'], which means $emp is an array too?
Maybe you should post the $readjson variable, so we'll know what type of data is.

If you want to enclose the text between ** with <strong></strong>, you need to use a regex. Here is a little code that does what you want :
function boldify($text) {
return preg_replace('/\*\*((.|\n|\r)*)\*\*/imU', '<strong>$1</strong>', $text);
}
Basically, it uses the function preg_replace to replace according to the regex (the first parameter).
How does this regex work :
1) You have \*\* at the beginning because that's your "opening tag". (* is a special regex character, so it needs escaping.)
2) You have ((.|\n|\r)*).
The inner part : .|\n|\r says "Catch me any character (the .) or (the |) a line feed (the \n) or a carriage return (the \r).".
Then you have the inner part enclosed with (inner part)*. This says "Match the inner part any number of time.".
Finally, you have the "middle part" enclosed with (middle part). This says "Remember what you just caught inside the parentheses, we will need it for the replace.
3) You have \*\* again.
4) All this is enclosed by /regex/imU.
The / are just there to say where the regex actually is.
The imU are flags: i is ignore case, m multiline, U ungreedy.
i are m are pretty straightforward, but U says "catch the smallest group possible".
As the second parameter you have '<strong>$1</strong>'. $1 is the group we remember from 2).
The third parameter is the subject.
I hope it was clear.
You can just use it like this :
echo boldify($emp['message']);

PHP variables look the same but are not equal (I'm confused)

OK, so I shave my head, but if I had hair I wouldn't need a razor because I'd have torn it all out tonight. It's gone 3am and what looked like a simple solution at 00:30 has become far from it.
Please see the code extract below..
$psusername = substr($list[$count],16);
if ($psusername == $psu_value){
$answer = "YES";
}
else {
$answer = "NO";
}
$psusername holds the value "normann" which is taken from a URL in a text based file (url.db)
$psu_value also holds the value "normann" which is retrieved from a cookie set on the user's computer (or a parameter in the browser address bar - URL).
However, and I'm sure you can guess my problem, the variable $answer contains "NO" from the test above.
All the PHP I know I've picked up from Google searches and you guys here, so I'm no expert, which is perhaps evident.
Maybe this is a schoolboy error, but I cannot figure out what I'm doing wrong. My assumption is that the data types differ. Ultimately, I want to compare the two variables and have a TRUE result when they contain the same information (i.e normann = normann).
So if you very clever fellows can point out why two variables echo what appears to be the same information but are in fact different, it'd be a very useful lesson for me and make my users very happy.

Do they echo the same thing when you do:
echo gettype($psusername) . '\n' . gettype($psu_value);

Since i can't see what data is stored in the array $list (and the index $count), I cannot suggest a full solution to yuor problem.
But i can suggest you to insert this code right before the if statement:
var_dump($psusername);
var_dump($psu_value);
and see why the two variables are not identical.
The var_dump function will output the content stored in the variable and the type (string, integer, array ec..), so you will figure out why the if statement is returning false

Since it looks like you have non-printable characters in your string, you can strip them out before the comparison. This will remove whatever is not printable in your character set:
$psusername = preg_replace("/[[:^print:]]/", "", $psusername);

0D 0A is a new line. The first is the carriage return (CR) character and the second is the new line (NL) character. They are also known as \r and \n.
You can just trim it off using trim().
$psusername = trim($psusername);
Or if it only occurs at the end of the string then rtrim() would do the job:
$psusername = rtrim($psusername);
If you are getting the values from the file using file() then you can pass FILE_IGNORE_NEW_LINES as the second argument, and that will remove the new line:
$contents = file('url.db', FILE_IGNORE_NEW_LINES);

I just want to thank all who responded. I realised after viewing my logfile the outputs in HEX format that it was the carriage return values causing the variables to mismatch and a I mentioned was able to resolve (trim) with the following code..
$psusername = preg_replace("/[^[:alnum:]]/u", '', $psusername);
I also know that the system within which the profiles and usernames are created allow both upper and lower case values to match, so I took the precaution of building that functionality into my code as an added measure of completeness.
And I'm happy to say, the code functions perfectly now.
Once again, thanks for your responses and suggestions.

PHP preg_replace inside for loop

I'm currently trying out this PHP preg_replace function and I've run into a small problem. I want to replace all the tags with a div with an ID, unique for every div, so I thought I would add it into a for loop. But in some strange way, it only do the first line and gives it an ID of 49, which is the last ID they can get. Here's my code:
$res = mysqli_query($mysqli, "SELECT * FROM song WHERE id = 1");
$row = mysqli_fetch_assoc($res);
mysqli_set_charset("utf8");
$lyric = $row['lyric'];
$lyricHTML = nl2br($lyric);
$lines_arr = preg_split('[<br />]',$lyricHTML);
$lines = count($lines_arr);
for($i = 0; $i < $lines; $i++) {
$string = preg_replace(']<br />]', '</h4><h4 id="no'.$i.'">', $lyricHTML, 1);
echo $i;
}
echo '<h4>';
echo $string;
echo '</h4>';
How it works is that I have a large amount of text in my database, and when I add it into the lyric variable, it's just plain text. But when I nl2br it, it gets after every line, which I use here. I get the number of by using the little "lines_arr" method as you can see, and then basically iterate in a for loop.
The only problem is that it only outputs on the first line and gives that an ID of 49. When I move it outside the for loop and removes the limit, it works and all lines gets an <h4> around them, but then I don't get the unique ID I need.
This is some text I pulled out from the database
Mama called about the paper turns out they wrote about me
Now my broken heart´s the only thing that's broke about me
So many people should have seen what we got going on
I only wanna put my heart and my life in songs
Writing about the pain I felt with my daddy gone
About the emptiness I felt when I sat alone
About the happiness I feel when I sing it loud
He should have heard the noise we made with the happy crowd
Did my Gran Daddy know he taught me what a poem was
How you can use a sentence or just a simple pause
What will I say when my kids ask me who my daddy was
I thought about it for a while and I'm at a loss
Knowing that I´m gonna live my whole life without him
I found out a lot of things I never knew about him
All I know is that I´ll never really be alone
Cause we gotta lot of love and a happy home
And my goal is to give every line an <h4 id="no1">TEXT</h4> for example, and the number after no, like no1 or no4 should be incremented every iteration, that's why I chose a for-loop.

Looks like you need to escape your regexp
preg_replace('/\[<br \/\]/', ...);
Really though, this is a classic XY Problem. Instead of asking us how to fix your solution, you should ask us how to solve your problem.
Show us some example text in the database and then show us how you would like it to be formatted. It's very likely there's a better way.
I would use array_walk for this. ideone demo here
$lines = preg_split("/[\r\n]+/", $row['lyric']);
array_walk($lines, function(&$line, $idx) {
$line = sprintf("<h4 id='no%d'>%s</h4>", $idx+1, $line);
});
echo implode("\n", $lines);
Output
<h4 id="no1">Mama called about the paper turns out they wrote about me</h4>
<h4 id="no2">Now my broken heart's the only thing that's broke about me</h4>
<h4 id="no3">So many people should have seen what we got going on</h4>
...
<h4 id="no16">Cause we gotta lot of love and a happy home</h4>
Explanation of solution
nl2br doesn't really help us here. It converts \n to <br /> but then we'd just end up splitting the string on the br. We might as well split using \n to start with. I'm going to use /[\r\n]+/ because it splits one or more \r, \n, and \r\n.
$lines = preg_split("/[\r\n]+/", $row['lyric']);
Now we have an array of strings, each containing one line of lyrics. But we want to wrap each string in an <h4 id="noX">...</h4> where X is the number of the line.
Ordinarily we would use array_map for this, but the array_map callback does not receive an index argument. Instead we will use array_walk which does receive the index.
One more note about this line, is the use of &$line as the callback parameter. This allows us to alter the contents of the $line and have it "saved" in our original $lyrics array. (See the Example #1 in the PHP docs to compare the difference).
array_walk($lines, function(&$line, $idx) {
Here's where the h4 comes in. I use sprintf for formatting HTML strings because I think they are more readable. And it allows you to control how the arguments are output without adding a bunch of view logic in the "template".
Here's the world's tiniest template: '<h4 id="no%d">%s</h4>'. It has two inputs, %d and %s. The first will be output as a number (our line number), and the second will be output as a string (our lyrics).
$line = sprintf('<h4 id="no%d">%s</h4>', $idx+1, $line);
Close the array_walk callback function
});
Now $lines is an array of our newly-formatted lyrics. Let's output the lyrics by separating each line with a \n.
echo implode("\n", $lines);
Done!

If your text in db is in every line why just not explode it with \n character?
Always try to find a solution without using preg set of functions, because they are heavy memory consumers:
I would go lke this:
$lyric = $row['lyric'];
$lyrics =explode("\n",$lyrics);
$lyricsHtml=null;
$i=0;
foreach($lyrics as $val){
$i++;
$lyricsHtml[] = '<h4 id="no'.$i.'">'.$val.'</h4>';
}
$lyricsHtml = implode("\n",$lyricsHtml);

An other way with preg_replace_callback:
$id = 0;
$lyric = preg_replace_callback('~(^)|$~m',
function ($m) use (&$id) {
return (isset($m[1])) ? '<h4 id="no' . ++$id . '">' : '</h4>'; },
$lyric);

preg_replace limit issue, handling array values

I've been working with the Sphider search engine for an internal website, we need to be able to quickly search for contact details in exported .htm(l) files.
$fulltxt = ereg_replace("[_A-Za-z0-9-]+(\.[_A-Za-z0-9-]+)*#[A-Za-z0-9-]+(\.[A-Za-z0-9-]+)*(\.[A-Za-z]{2,3})", "\\0", $fulltxt);
I am replacing e-mail addresses with a convenient mailto: link so users can open Outlook straight from the search results.
However,
while (preg_match("/[^\>](".$change.")[^\<]/i", " ".$fulltxt." ", $regs)) {
$fulltxt = preg_replace("/".$regs[1]."/i", "<b>".$regs[1]."</b>", $fulltxt);
}
It replaces all matches in the search results with bold tags, which resuts into the tags been included in Outlook's 'To...' field. It looks something like this in HTML (thanks Yuriy):
<b>name</b>.surname#domain
I have tried adding a value to the 'limit' parameter:
while (preg_match("/[^\>](".$change.")[^\<]/i", " ".$fulltxt." ", $regs)) {
$fulltxt = preg_replace("/".$regs[1]."/i", "<b>".$regs[1]."</b>", $fulltxt, 1);
}
Supposingly this should be the solution to my problem by simply replacing only the first occurrence (being the name as the pattern is name-phone num-email and we always search by name), instead it only makes it incredibly slow to the point i get a timeout message from the server. I've been trying various solutions but have been out of luck.
Any ideas? Am i doing something wrong?
Thanks.
(*Original heavily edited).

Did I understand you right that something like this happens?
<b>email#domain</b>
Why don't you put tags into search results first, and only then apply "mailto:" anchors to emails? Added 's would be easy to filter out in the patter on that second step.

A maximum character limit on the preg functions?

On my site I use output buffering to grab all the output and then run it through a process function before sending it out to the browser (I don't replace anything, just break it into more manageable pieces). In this particular case, there is a massive amount of output because it is listing out a label for every country in the database (around 240 countries). The problem is that in full, my preg_match functions seems to get skipped over, it does absolutely nothing and returns no matches. However, if I remove parts of the labels (no particular part, just random pieces to reduce characters) then the preg_match functions works again. It doesn't seem to matter what I remove from the label, it just seems to be that as long as I remove so many characters. Is there some sort of cap on what the preg functions can handle or will it time out if there is too much data to be scanned over?
Edit: Here is the function that it is run through.
public function boom($data) {
$number = preg_match_all("/(<!-- ([\w]+):start -->)\n?(.*?)\n?(<!-- \\2:stop -->)/s", $data, $matches, PREG_SET_ORDER);
if ($number == 0) $data = array("content" => $data);
else unset($data);
foreach ($matches as $item):
//$item[3] = preg_replace("/\n/", "", $item[3], 1);
if (preg_match("/<!-- breaker -->/s", $item[3])) $data[$item[2]] = explode("<!-- breaker -->", $item[3]);
else $data[$item[2]] = $item[3];
endforeach;
//die(var_dump($data));
return $data;
}
And here is the unprocessed output that is sent to the page. I have determined that it is the preg_match_all at the beginning that is returning 0 matches in the variable, so the function is simply throwing the entire string it received into $data['content'] and skipping everything else.
http://pastebin.com/iGfM6gxx
I've tried putting the labels on new lines, collapsing them together, nothing seems to work. But as explained above, if I remove random pieces of it, then it goes through fine. The function works perfectly fine with every other page of normal length.

it's hard to say without seeing your regex and data, but try to change pcre.backtrack_limit / pcre.recursion_limit ( http://www.php.net/manual/en/pcre.configuration.php)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP website data mining Preg_Match Undefined Offset - php

Related

I am getting a error using php (str_replace)

PHP variables look the same but are not equal (I'm confused)

PHP preg_replace inside for loop

preg_replace limit issue, handling array values

A maximum character limit on the preg functions?

Categories

Resources