Screen Scraping - php

Hi I'm trying to implement a screen scraping scenario on my website and have the following set so far. What I'm ultimately trying to do is replace all links in the $results variable that have "ResultsDetails.aspx?" to "results-scrape-details/" then output again. Can anyone point me in the right direction?
<?php
$url = "http://mysite:90/Testing/label/stuff/ResultsIndex.aspx";
$raw = file_get_contents($url);
$newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");
$content = str_replace($newlines, "", html_entity_decode($raw));
$start = strpos($content,"<div id='pageBack'");
$end = strpos($content,'</body>',$start) + 6;
$results = substr($content,$start,$end-$start);
$pattern = 'ResultsDetails.aspx?';
$replacement = 'results-scrape-details/';
preg_replace($pattern, $replacement, $results);
echo $results;

Use a DOM tool like PHP Simple HTML DOM. With it you can find all the links you're looking for with a Jqueryish syntax.
// Create DOM object from HTML source
$dom = file_get_html('http://www.domain.com/path/to/page');
// Iterate all matching links
foreach ($dom->find('a[href^=ResultsDetails.aspx') as $node) {
// Replace href attribute value
$node->href = 'results-scrape-detail/';
}
// Output modified DOM
echo $dom->outertext;

The ? char has special meaning in regexes - either escape it and use the same code or replace the preg_replace with str_ireplace() (I'd recommend the latter approach as it is also more efficient).
(and should the html_entity_decode call really be there?)
C.

Related

Get URL from <a> tag with php

Hi. I have a string that looks like this:
55650-vaikospinta-54vnt-lape.pdf
I'm trying to pull URL out with PHP, I want result like this:
https://website.com/c4ca4238a0b923820dcc509a6f75849b/2020/11/55650-vaikospinta-54vnt-lape.pdf
Things I've tried:
From another StackOverflow question, I tried this:
$a = new SimpleXMLElement($FileURL);
$file = 'SimpleXMLElement.txt';
file_put_contents($file, $a);
But result I get is just the string in between and , this:
55650-vaikospinta-54vnt-lape.pdf
Also from another StackOverflow question, I tried using preg_match, like this:
$file = 'preg_match.txt';
preg_match_all('/<a[^>]+href=([\'"])(?<href>.+?)\1[^>]*>/i', $FileURL, $result);
if (!empty($result)) {
# Found a link.
file_put_contents($file, $result);
}
I have no idea how regex works (assuming that's regex), but the result I get is just...:
ArrayArrayArrayArray
Thanks for any help!
You can use DOMDocument with loadHtml and getElementsByTagName as below
$str = '55650-vaikospinta-54vnt-lape.pdf
';
$doc = new DOMDocument();
$d=$doc->loadHtml($str);
$a = $doc->getElementsByTagName('a');
foreach ($a as $vals) {
$href = $vals->getAttribute('href');
print_r($href); PHP_EOL;
}
if you dont want to use foreach then u can use as $href = $a[0]->getAttribute('href');
Result will be
https://website.com/c4ca4238a0b923820dcc509a6f75849b/2020/11/55650-vaikospinta-54vnt-lape.pdf
If you insist using regular expression, ie. regex, this works:
<?php
$your_var = '55650-vaikospinta-54vnt-lape.pdf';
preg_match('/<a[^>]+href=([\'"])(?<href>.+?)\1[^>]*>/i', $your_var, $result);
$url = $result[2];
echo "Your URL: $url";
For example, you can validate your regex online: https://regex101.com/
XPath way:
$href = (string) simplexml_load_string($html)->xpath('//a/#href')[0]->href;

Regex to get all href tags from text

I have huge text which contains normal text and href tags. I want to retrieve all href tags by using regular expressions.
I tried href="([^"]*)" but it is returning only one href value.
$result[] = $util->execute(self::$queryToGetContentFromPagesEng3); //getting text from database
foreach ($result as $temp) {
if(preg_match("href=\"([^\"]*)\"",$temp)) {
$storeUrl []=$temp;
}
}
I need the result like this:
href=/public/coursecontent/2017-08-03-12-bhnhlwdjzyblelskiard.docx
href=/public/coursecontent/2016-07-07-07-rncsuatxhkkbeomysbmk.docx
My first point would be that regular expressions may well not be the path you want to take in this case.
But continuing with it, you might try preg_match_all instead of preg_match to find multiple occurrences and store them in an array, and from there in your foreach you can run a preg_match_all and store it in an array and array_merge this into your $storeUrul array.
However, I believe a simpler approach to this, that is most likely more reliable as well would be to parse the HTML and work from the DOM. Here is a brief guide, that simplifies to something like this in your case:
$dom = new DOMDocument();
$dom->loadHTML($result);
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("a");
for($i = 0; $i < $hrefs->length; $i++){
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
$storeUrl[] = $url;
}
Since the title is js regex...
const myString = '...'
const regex = /href=".+?"/gi;
const regex2 = /(?<=href=").+?(?=")/gi;
//regex2 is without 'href' and "
myString.match(regex);

Unique regex replacement with capture

I have a text (html code) with many images like:
<img src="X" attributes />
I need the src value to be replaced by a unique identification like CID:# where # is this unique value. I don't know if the src values will be all different, maybe some of them can be equal.
Bellow is the code with the regular expression to match the images. Now, how to make the replacement?
PS: I need to store in a array the relation between the unique code created and the string that was replaced. For instance, i need to know that the 345 id is relative to the url "img/xxx.jpg".
preg_match_all('/<img src=[",\']([^>,^\',^"]*)[",\']([^>]*)/', $html, $matches);
$url_image = array();
$attr_image = array();
$cid = array();
foreach ($matches[1] as $i => $img){
$url_image[$i] = $matches[2][$i];
$attr_image[$i] = $matches[3][$i];
//How to replace the src value with the value of $cid?
$cid[$contador] = "CID:".date('YmdHms').'.'.time().$i;
}
It's generally a very bad idea to modify HTML/XML with regular expressions. It's nearly impossible to get right and tends to have unpleasant unintended side effects later.
You'd be much better off using something like the Tidy extension and a DOMDocument to parse the result and perform the attribute replacements you need to do.
Here is the solution used:
preg_match_all('/<img src=[",\']([^>,^\',^"]*)[",\']([^>]*)/', $html, $matches);
$url_image = array();
$attr_image = array();
$cid = array();
foreach ($matches[1] as $i => $img){
$url_image[$i] = $matches[1][$i];
$attr_image[$i] = $matches[2][$i];
$cid[$i] = "CID:".date('YmdHms').'.'.time().$i;
$tag_img = str_replace("/", "\/", $img);
//Replace each specific occurrence
$html = preg_replace('/'.$tag_img.'/', $cid[$i], $html, 1);
}

regular expression for page scraping

I'm trying to write a page scraping script to take a currency of a site. I need some help writing the regular expression.
Here is what I have so far.
<?php
function converter(){
// Create DOM from URL or file
$html = file_get_contents("http://www.bloomberg.com/personal- finance/calculators/currency-converter/");
// Find currencies. ( using h1 to test)
preg_match('/<h1>(.*)<\/h1>/i', $html, $title);
$title_out = $title[1];
echo $title_out;
}
$foo = converter();
echo $foo;
?>
Here is where the currencies are kept on the Bloomberg site.
site: http://www.bloomberg.com/personal-finance/calculators/currency-converter/
//<![CDATA[
var test_obj = new Object();
var price = new Object();
price['ADP:CUR'] = 125.376;
What would the expression look like to get that rate?
Any help would be great!!
This works for me - does it need to be more flexible? And does it need to take various whitespace - or is it alway exactly one space? (around the equal sign)
"/price\['ADP:CUR'\] = (\d+\.\d+/)"
Usage:
if(preg_match("/price\['ADP:CUR'\] = (\d+\.\d+)/", $YOUR_HTML, $m)) {
//Result is in $m[1]
} else {
//Not found
}
there you go:
/ADP:CUR[^=]*=\s*(.*?);/i
This returns an associate array identical to the javascript object on the bloomberg site.
<?php
$data = file_get_contents('http://www.bloomberg.com/personal-finance/calculators/currency-converter/');
$expression = '/price\\[\'(.*?)\'\\]\\s+=\\s+([+-]?\\d*\\.\\d+)(?![-+0-9\\.]);/';
preg_match_all($expression, $data, $matches);
$array = array_combine($matches[1], $matches[2]);
print_r($array);
echo $array['ADP:CUR'];// string(7) "125.376"
?>

regex help with getting tag content in PHP

so I have the code
function getTagContent($string, $tagname) {
$pattern = "/<$tagname.*?>(.*)<\/$tagname>/";
preg_match($pattern, $string, $matches);
print_r($matches);
}
and then I call
$url = "http://www.freakonomics.com/2008/09/24/wall-street-jokes-please/";
$html = file_get_contents($url);
getTagContent($html,"title");
but then it shows that there are no matches, while if you open the source of the url there clearly exist a title tag....
what did I do wrong?
try DOM
$url = "http://www.freakonomics.com/2008/09/24/wall-street-jokes-please/";
$doc = new DOMDocument();
$dom = $doc->loadHTMLFile($url);
$items = $doc->getElementsByTagName('title');
for ($i = 0; $i < $items->length; $i++)
{
echo $items->item($i)->nodeValue . "\n";
}
The 'title' tag is not on the same line as its closing tag, so your preg_match doesn't find it.
In Perl, you can add a /s switch to make it slurp the whole input as though on one line: I forget whether preg_match will let you do so or not.
But this is just one of the reasons why parsing XML and variants with regexp is a bad idea.
Probably because the title is spread on multiple lines. You need to add the option s so that the dot will also match any line returns.
$pattern = "/<$tagname.*?>(.*)<\/$tagname>/s";
Have your php function getTagContent like this:
function getTagContent($string, $tagname) {
$pattern = '/<'.$tagname.'[^>]*>(.*?)<\/'.$tagname.'>/is';
preg_match($pattern, $string, $matches);
print_r($matches);
}
It is important to use non-greedy match all .*? for matching text between start and end of tag and equally important is to use flags s for DOTALL (matches new line as well) and i for ignore case comparison.

Categories