regular expression for page scraping

regular expression for page scraping - php

I'm trying to write a page scraping script to take a currency of a site. I need some help writing the regular expression.
Here is what I have so far.
<?php
function converter(){
// Create DOM from URL or file
$html = file_get_contents("http://www.bloomberg.com/personal- finance/calculators/currency-converter/");
// Find currencies. ( using h1 to test)
preg_match('/<h1>(.*)<\/h1>/i', $html, $title);
$title_out = $title[1];
echo $title_out;
}
$foo = converter();
echo $foo;
?>
Here is where the currencies are kept on the Bloomberg site.
site: http://www.bloomberg.com/personal-finance/calculators/currency-converter/
//<![CDATA[
var test_obj = new Object();
var price = new Object();
price['ADP:CUR'] = 125.376;
What would the expression look like to get that rate?
Any help would be great!!

This works for me - does it need to be more flexible? And does it need to take various whitespace - or is it alway exactly one space? (around the equal sign)
"/price\['ADP:CUR'\] = (\d+\.\d+/)"
Usage:
if(preg_match("/price\['ADP:CUR'\] = (\d+\.\d+)/", $YOUR_HTML, $m)) {
//Result is in $m[1]
} else {
//Not found
}

there you go:
/ADP:CUR[^=]*=\s*(.*?);/i

This returns an associate array identical to the javascript object on the bloomberg site.
<?php
$data = file_get_contents('http://www.bloomberg.com/personal-finance/calculators/currency-converter/');
$expression = '/price\\[\'(.*?)\'\\]\\s+=\\s+([+-]?\\d*\\.\\d+)(?![-+0-9\\.]);/';
preg_match_all($expression, $data, $matches);
$array = array_combine($matches[1], $matches[2]);
print_r($array);
echo $array['ADP:CUR'];// string(7) "125.376"
?>

Related

Get URL from <a> tag with php

Hi. I have a string that looks like this:
55650-vaikospinta-54vnt-lape.pdf
I'm trying to pull URL out with PHP, I want result like this:
https://website.com/c4ca4238a0b923820dcc509a6f75849b/2020/11/55650-vaikospinta-54vnt-lape.pdf
Things I've tried:
From another StackOverflow question, I tried this:
$a = new SimpleXMLElement($FileURL);
$file = 'SimpleXMLElement.txt';
file_put_contents($file, $a);
But result I get is just the string in between and , this:
55650-vaikospinta-54vnt-lape.pdf
Also from another StackOverflow question, I tried using preg_match, like this:
$file = 'preg_match.txt';
preg_match_all('/<a[^>]+href=([\'"])(?<href>.+?)\1[^>]*>/i', $FileURL, $result);
if (!empty($result)) {
# Found a link.
file_put_contents($file, $result);
}
I have no idea how regex works (assuming that's regex), but the result I get is just...:
ArrayArrayArrayArray
Thanks for any help!

You can use DOMDocument with loadHtml and getElementsByTagName as below
$str = '55650-vaikospinta-54vnt-lape.pdf
';
$doc = new DOMDocument();
$d=$doc->loadHtml($str);
$a = $doc->getElementsByTagName('a');
foreach ($a as $vals) {
$href = $vals->getAttribute('href');
print_r($href); PHP_EOL;
}
if you dont want to use foreach then u can use as $href = $a[0]->getAttribute('href');
Result will be
https://website.com/c4ca4238a0b923820dcc509a6f75849b/2020/11/55650-vaikospinta-54vnt-lape.pdf

If you insist using regular expression, ie. regex, this works:
<?php
$your_var = '55650-vaikospinta-54vnt-lape.pdf';
preg_match('/<a[^>]+href=([\'"])(?<href>.+?)\1[^>]*>/i', $your_var, $result);
$url = $result[2];
echo "Your URL: $url";
For example, you can validate your regex online: https://regex101.com/

XPath way:
$href = (string) simplexml_load_string($html)->xpath('//a/#href')[0]->href;

PHP Regex match string but exclude a certain word

This question has been asked multiple times, but I didn't find a working solution for my needs.
I've created a function to check for the URLs on the output of the Google Ajax API:
https://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=site%3Awww.bierdopje.com%2Fusers%2F+%22Gebruikersprofiel+van+%22+Stevo
I want to exclude the word "profile" from the output. So that if the string contains that word, skip the whole string.
This is the function I've created so far:
function getUrls($data)
{
$regex = '/https?\:\/\/www.bierdopje.com[^\" ]+/i';
preg_match_all($regex, $data, $matches);
return ($matches[0]);
}
$urls = getUrls($data);
$filteredurls = array_unique($urls);
I've created a sample to make clear what I mean exactly:
http://rubular.com/r/1U9YfxdQoU
In the sample you can see 4 strings selected from which I only need the upper 2 strings.
How can I accomplish this?

function getUrls($data)
{
$regex = '#"(https?://www\\.bierdopje\\.com[^"]*+(?<!/profile))"#';
return preg_match_all($regex, $data, $matches) ?
array_unique($matches[1]) : array();
}
$urls = getUrls($data);
Result: http://ideone.com/dblvpA
vs json_decode: http://ideone.com/O8ZixJ
But generally you should use json_decode.

Don't use regular expressions to parse JSON data. What you want to do is parse the JSON and loop over it to find the correct matching elements.
Sample code:
$input = file_get_contents('https://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=site%3Awww.bierdopje.com%2Fusers%2F+%22Gebruikersprofiel+van+%22+Stevo');
$parsed = json_decode($input);
$cnt = 0;
foreach($parsed->responseData->results as $response)
{
// Skip strings with 'profile' in there
if(strpos($response->url, 'profile') !== false)
continue;
echo "Result ".++$cnt."\n\n";
echo 'URL: '.$response->url."\n";
echo 'Shown: '.$response->visibleUrl."\n";
echo 'Cache: '.$response->cacheUrl."\n\n\n";
}
Sample on CodePad (since it doesn't support loading external files the string is inlined there)

preg_match, addslashes,mb_substr not working for long strings

I am parsing an html file. I have a big string which is basically a script.
The string looks likes this:
var spConfig = new
Product.Config({"outofstock":["12663"],"instock":["12654","12655","12656","12657","12658","12659","12660","12661","12662","12664","12665"],"attributes":{"698":{"id":"698","code":"aubade_import_colorcode","label":"Colorcode","options":[{"id":"650","label":"BLUSH","price":"0","products":["12654","12655","12656","12657","12658","12659","12660","12661","12662","12663","12664","12665"]}]},"689":{"id":"689","code":"aubade_import_size_width","label":"Size
Width","options":[{"id":"449","label":"85","price":"0","products":["12654","12657","12660","12663"]},{"id":"450","label":"90","price":"0","products":["12655","12658","12661","12664"]},{"id":"451","label":"95","price":"0","products":["12656","12659","12662","12665"]}]},"702":{"id":"702","code":"aubade_import_size_cup","label":"Size
Cup","options":[{"id":"1501","label":"A","price":"0","products":["12654","12655","12656"]},{"id":"1502","label":"B","price":"0","products":["12657","12658","12659"]},{"id":"1503","label":"C","price":"0","products":["12660","12661","12662"]},{"id":"1504","label":"D","price":"0","products":["12663","12664","12665"]}]}},"template":"\u20ac#{price}","basePrice":"57","oldPrice":"57","productId":"12666","chooseText":"Choose
option...","taxConfig":{"includeTax":true,"showIncludeTax":true,"showBothPrices":false,"defaultTax":19.6,"currentTax":19.6,"inclTaxTitle":"Incl.
Tax"}});
var colorarray = new Array();
colorarray["c650"] = 'blush';
Event.observe('attribute698', 'change', function() {
var colorId = $('attribute698').value;
var attribute = 'attribute698';
var label = colorarray["c"+colorId];
if ($('attribute698').value != '') {
setImages(attribute, colorId, label);
}
}); // var currentColorLabel = 'blush'; // var currentSku = '5010-4-n'; // var currentPosition = 'v'; // //
Event.observe(window, 'load', function() { //
setImages('attribute698', null, currentColorLabel); // });
I need to extract the content from first "(" upto first ";".
I have tried to do string extract and failed.I have tried preg match I have failed.
Kindly tell me some solution to my problem.Below are my tried solution and issues.
$strScript = $tagscript->item(0)->nodeValue;
//this line returns empty string
$str_slashed = addslashes(trim($strScript) );
$pattern = '/\((.*);/';
preg_match($pattern,$str_slashed,$matches);
echo 'matches'."<br />";
var_dump($matches);
//Add slashes works only if I use it before assignment to other string
$matches = array();
$strScript = addslashes ($tagscript->item(0)->nodeValue);//. "<br />";
$pattern = '/\((.*);/';
preg_match($pattern,$strScript,$matches);
echo 'matches'."<br />";
var_dump($matches);
//str extract method
$posBracket = stripos ($strScript,'(');
echo $posBracket."<br />";
$posSemiColon = strpos ($strScript,';');
echo $posSemiColon."<br />";
$temp = mb_substr ($strScript,$posBracket ,($posSemiColon-$posBracket));
echo $temp."<br />";
The above code works for small strings
$strScript = "manisha( [is goo girl] {come(will miss u) \and \"play} ; lets go home;";
but wont work for the long strings.
How can i resolve this issue?Please help me!

You have to add multiline switch to your regular expressions.
Try $pattern = '/\((.*);/s'; or $pattern = '/\((.*);/m';

Try using /\(([^;]*)/ as your pattern. [^;] means any character that is not a ;.
Edit: also turn multiline mode on, as suggested by rogers; therefore the whole pattern should look somewhat like /\(([^;]*)/s.
Edit: you should be aware, that this is not really error-proof. Say, you'll get a ; inside some property of the object of which JSON representation is included in your string.

php regex - Scraping images from javascript object

I'm trying to scrape images from the mark-up of certain webpages. These webpages all have a slideshow. Their sources are contained in javascript objects on the page. I'm thinking i need to get_file_contents("http://www.example.com/page/1"); and then have a preg_match_all() function that i can input a phrase(ie. "\"LargeUrl\":\"", or "\"Description\":\"") and get the string of characters until it hits the next quotation mark it finds.
var photos = {};
photos['photo-391094'] = {"LargeUrl": "http://www.example.org/images/1.png","Description":"blah blah balh"};
photos['photo-391095'] = {"LargeUrl": "http://www.example.org/images/2.png","Description":"blah blah balh"};
photos['photo-391096'] = {"LargeUrl": "http://www.example.org/images/3.png","Description":"blah blah balh"};
I have this function, but it returns the entire line after the input phrase. How can i modify it to look for whatever's after the input phrase up until it hits the next quotation mark it finds? Or am i doing this all wrong and there's a better way?
$page = file_get_contents("http://www.example.org/page/1");
$word = "\"LargeUrl\":\"";
if(preg_match_all("/(?<=$word)\S+/i", $page, $matches))
{
echo "<pre>";
print_r($matches);
echo "</pre>";
}
Ideally the function would return a an array like the following if i inputed "\"LargeUrl\":\""
$matches[0] = "http://www.example.org/images/1.png";
$matches[1] = "http://www.example.org/images/2.png";
$matches[2] = "http://www.example.org/images/3.png";

You can use parenthesis to capture the parts you're interested in. A simple regex to do it is
$word = '"LargeUrl":';
$pattern = "$word" . '\s+"([^"]+)"';
preg_match_all("/$pattern/", $page, $matches);
print_r($matches[1]);

There is definitely a regex that will match each image URL, but you could also, if its easier for you, match the whole object and then json_decode() the matched string

I have perfect solution for you....use the following code and you will get your needed result.
preg_match_all('/{"LargeUrl":(.*?)"(.*?)"/', $page, $result, PREG_PATTERN_ORDER);
for ($i = 0; $i < count($result[0]); $i++) {
echo "<pre>";
echo $result[2][$i];
echo "</pre>";
}
Thanks......p2c

Screen Scraping

Hi I'm trying to implement a screen scraping scenario on my website and have the following set so far. What I'm ultimately trying to do is replace all links in the $results variable that have "ResultsDetails.aspx?" to "results-scrape-details/" then output again. Can anyone point me in the right direction?
<?php
$url = "http://mysite:90/Testing/label/stuff/ResultsIndex.aspx";
$raw = file_get_contents($url);
$newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");
$content = str_replace($newlines, "", html_entity_decode($raw));
$start = strpos($content,"<div id='pageBack'");
$end = strpos($content,'</body>',$start) + 6;
$results = substr($content,$start,$end-$start);
$pattern = 'ResultsDetails.aspx?';
$replacement = 'results-scrape-details/';
preg_replace($pattern, $replacement, $results);
echo $results;

Use a DOM tool like PHP Simple HTML DOM. With it you can find all the links you're looking for with a Jqueryish syntax.
// Create DOM object from HTML source
$dom = file_get_html('http://www.domain.com/path/to/page');
// Iterate all matching links
foreach ($dom->find('a[href^=ResultsDetails.aspx') as $node) {
// Replace href attribute value
$node->href = 'results-scrape-detail/';
}
// Output modified DOM
echo $dom->outertext;

The ? char has special meaning in regexes - either escape it and use the same code or replace the preg_replace with str_ireplace() (I'd recommend the latter approach as it is also more efficient).
(and should the html_entity_decode call really be there?)
C.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

regular expression for page scraping - php

there you go: /ADP:CUR[^=]=\s(.*?);/i

Related

Get URL from <a> tag with php

PHP Regex match string but exclude a certain word

preg_match, addslashes,mb_substr not working for long strings

php regex - Scraping images from javascript object

Screen Scraping

Categories

Resources

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

regular expression for page scraping - php

there you go: /ADP:CUR[^=]*=\s*(.*?);/i

Related

Get URL from <a> tag with php

PHP Regex match string but exclude a certain word

preg_match, addslashes,mb_substr not working for long strings

php regex - Scraping images from javascript object

Screen Scraping

Categories

Resources

there you go: /ADP:CUR[^=]=\s(.*?);/i