how to find urls under double quote

how to find urls under double quote - php

let's say we load the source code of this question and we want to find the url alongside "childUrl"
or goto this site source code and search "childUrl".
<?php
$sites_html = file_get_contents("https://stackoverflow.com/questions/46272862/how-to-find-urls-under-double-quote");
$html = new DOMDocument();
#$html->loadHTML($sites_html);
foreach() {
# now i want here to echo the link alongside "childUrl"
}
?>

Try this
<?php
function extract($url){
$sites_html = file_get_contents("$url");
$html = new DOMDocument();
$$html->loadHTML($sites_html);
foreach ($html->loadHTML($sites_html) as $row)
{
if($row=="wanted_url")
{
echo $row;
}
}
}
?>

you can use regex:
try this code
$matches = [[],[]];
preg_match_all('/\"wanted_url\": \"([^\"]*?)\"/', $sites_html, $matches);
foreach($matches[1] as $match) {
echo $match;
}
this will print all urls with wanted_url tag

Related

Optimize remote page retrieving and parsing

I'm retrieving a remote page with PHP, getting a few links from that page and accessing each link and parsing it.
It takes me about 12 seconds which are way too much, and I need to optimize the code somehow.
My code is something like that:
$result = get_web_page('THE_WEB_PAGE');
preg_match_all('/<a data\-a=".*" href="(.*)">/', $result['content'], $matches);
foreach ($matches[2] as $lnk) {
$result = get_web_page($lnk);
preg_match('/<span id="tests">(.*)<\/span>/', $result['content'], $match);
$re[$index]['test'] = $match[1];
preg_match('/<span id="tests2">(.*)<\/span>/', $result['content'], $match);
$re[$index]['test2'] = $match[1];
preg_match('/<span id="tests3">(.*)<\/span>/', $result['content'], $match);
$re[$index]['test3'] = $match[1];
++$index;
}
I have some more preg_match calls inside the loop.
How can I optimize my code?
Edit:
I've changed my code to use xpath instead of regex, and it became much more slower.
Edit2:
That's my full code:
<?php
$begin = microtime(TRUE);
$result = get_web_page('WEB_PAGE');
$dom = new DOMDocument();
$dom->loadHTML($result['content']);
$xpath = new DOMXPath($dom);
// Get the links
$matches = $xpath->evaluate('//li[#class = "lasts"]/a[#class = "lnk"]/#href | //li[#class=""]/a[ #class = "lnk"]/#href');
if ($matches === FALSE) {
echo 'error';
exit();
}
foreach ($matches as $match) {
$links[] = 'WEB_PAGE'.$match->value;
}
$index = 0;
// For each link
foreach ($links as $link) {
echo (string)($index).' loop '.(string)(microtime(TRUE)-$begin).'<br>';
$result = get_web_page($link);
$dom = new DOMDocument();
$dom->loadHTML($result['content']);
$xpath = new DOMXPath($dom);
$match = $xpath->evaluate('concat(//span[#id = "header"]/span[#id = "sub_header"]/text(), //span[#id = "header"]/span[#id = "sub_header"]/following-sibling::text()[1])');
if ($matches === FALSE) {
exit();
}
$data[$index]['name'] = $match;
$matches = $xpath->evaluate('//li[starts-with(#class, "active")]/a/text()');
if ($matches === FALSE) {
exit();
}
foreach ($matches as $match) {
$data[$index]['types'][] = $match->data;
}
$matches = $xpath->evaluate('//span[#title = "this is a title" and #class = "info"]/text()');
if ($matches === FALSE) {
exit();
}
foreach ($matches as $match) {
$data[$index]['info'][] = $match->data;
}
$matches = $xpath->evaluate('//span[#title = "this is another title" and #class = "name"]/text()');
if ($matches === FALSE) {
exit();
}
foreach ($matches as $match) {
$data[$index]['names'][] = $match->data;
}
++$index;
}
?>

As others mentioned, use a parser instead (ie DOMDocument) and combine it with xpath queries. Consider the following example:
<?php
# set up some dummy data
$data = <<<DATA
<div>
<a class='link'>Some link</a>
<a class='link' id='otherid'>Some link 2</a>
</div>
DATA;
$dom = new DOMDocument();
$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
# all links
$links = $xpath->query("//a[#class = 'link']");
print_r($links);
# special id link
$special = $xpath->query("//a[#id = 'otherid']")
# and so on
$textlinks = $xpath->query("//a[startswith(text(), 'Some')]");
?>

Consider using a DOM framework for PHP. This should be way faster.
Use PHP's DOMDocument with xpath queries:
http://php.net/manual/en/class.domdocument.php
See Jan's answer for more explanation.
The following also works but is less preferable, according to the comments.
For example:
http://simplehtmldom.sourceforge.net/
an example to get all a tags on a page:
<?php
include_once('simple_html_dom.php');
$url = "http://your_url/";
$html = new simple_html_dom();
$html->load_file($url);
foreach($html->find("a") as $link)
{
// do something with the link
}
?>

PHP Web Crawler doesn't crawl .php files

This is the simple webcrawler I was trying to build
<?php
$to_crawl = "http://samplewebsite.com/about.php";
function get_links($url)
{
$input = #file_get_contents($url);
$regexp = " <a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a> ";
preg_match_all("/$regexp/siU", $input, $matches);
$l = $matches[2];
foreach ($l as $link) {
echo $link."</br>";
}
}
get_links($to_crawl);
?>
When I try to run the script with the $to_crawl variable set to a url ending with a file name, e.g. "facebook.com/about", it works, but for some reason, it just echo's nothing when the link is ending with a '.php' filename. Can someone please help?

To get all links and their inner texts, you can use DOMDocument like this:
$dom = new DOMDocument;
#$dom->loadHTML($input); // Your input (HTML code)
$xp = new DOMXPath($dom);
$links = $xp->query('//a[#href]'); // XPath to get only <a> tags with a href attribute
$result = array();
foreach ($links as $link) {
$result[] = array($link->getAttribute("href"), $link->nodeValue);
}
print_r($result);
See IDEONE demo

Simple HTML DOM Not Finding DIV

I have code trying to extract the Event SKU from the Robot Events Page, here is an example. The code that I am using dosn't find any of the SKU on the page. The SKU is on line 411, with a div of the class "product-sku". My code doesn't event find the Div on the page and just downloads all the events. Here is my code:
<?php
require('simple_html_dom.php');
$html = new simple_html_dom();
if(!$events)
{
echo mysqli_error($con);
}
while($event = mysqli_fetch_row($events))
{
$htmldown = file_get_html($event[4]);
$html->load($htmldown);
echo "Downloaded";
foreach ($html->find('div[class=product-sku]') as $row) {
$sku = $row->plaintext;
echo $sku;
}
}
?>
Can anyone help me fix my code?

This code is used DOMDocument php class. It works successfully for below sample HTML. Please try this code.
// new dom object
$dom = new DOMDocument();
// HTML string
$html_string = '<html>
<body>
<div class="product-sku1" name="div_name">The this the div content product-sku</div>
<div class="product-sku2" name="div_name">The this the div content product-sku</div>
<div class="product-sku" name="div_name">The this the div content product-sku</div>
</body>
</html>';
//load the html
$html = $dom->loadHTML($html_string);
//discard white space
$dom->preserveWhiteSpace = TRUE;
//the table by its tag name
$divs = $dom->getElementsByTagName('div');
// loop over the all DIVs
foreach ($divs as $div) {
if ($div->hasAttributes()) {
foreach ($div->attributes as $attribute){
if($attribute->name === 'class' && $attribute->value == 'product-sku'){
// Peri DIV class name and content
echo 'DIV Class Name: '.$attribute->value.PHP_EOL;
echo 'DIV Content: '.$div->nodeValue.PHP_EOL;
}
}
}
}

I would use a regex (regular expression) to accomplish pulling skus out.
The regex:
preg_match('~<div class="product-sku"><b>Event Code:</b>(.*?)</div>~',$html,$matches);
See php regex docs.
New code:
<?php
if(!$events)
{
echo mysqli_error($con);
}
while($event = mysqli_fetch_row($events))
{
$htmldown = curl_init($event[4]);
curl_setopt($htmldown, CURLOPT_RETURNTRANSFER, true);
$html=curl_exec($htmldown);
curl_close($htmldown)
echo "Downloaded";
preg_match('~<div class="product-sku"><b>Event Code:</b>(.*?)</div>~',$html,$matches);
foreach ($matches as $row) {
echo $row;
}
}
?>
And actually in this case (using that webpage) being that there is only one sku...
instead of:
foreach ($matches as $row) {
echo $row;
}
You could just use: echo $matches[1]; (The reason for array index 1 is because the whole regex pattern plus the sku will be in $matches[0] but just the subgroup containing the sku is in $matches[1].)

try to use
require('simple_html_dom.php');
$html = new simple_html_dom();
if(!$events)
{
echo mysqli_error($con);
}
while($event = mysqli_fetch_row($events))
{
$htmldown = str_get_html($event[4]);
echo "Downloaded";
foreach ($htmldown->find('div[class=product-sku]') as $row) {
$sku = $row->plaintext;
echo $sku;
}
}
and if class "product-sku" is only for div's then you can use
$htmldown->find('.product-sku')

scrapr image url from wikipedia page

I created regex which gives image url from the source code of the page.
<?php
function get_logo($html, $url)
{
//preg_match_all('', $html, $matches);
//preg_match_all('~\b((\w+ps?://)?\S+(png|jpg))\b~im', $html, $matches);
if (preg_match_all('/\bhttps?:\/\/\S+(?:png|jpg)\b/', $html, $matches)) {
echo "First";
return $matches[0][0];
} else {
if (preg_match_all('~\b((\w+ps?://)?\S+(png|jpg))\b~im', $html, $matches)) {
echo "Second";
return url_to_absolute($url, $matches[0][0]);
//return $matches[0][0];
} else
return null;
}
}
But for wikipedia page image url is like this
http://en.wikipedia.org/wiki/File:Nelson_Mandela-2008_(edit).jpg
which always fails in my regex.
How can I get rid of this?

Why try to parse HTML with regex when this can easily be done with the DOMDocument class in PHP.
<?php
$doc = new DOMDocument();
#$doc->loadHTMLfile( "http://www.wikipedia.org/" );
$images = $doc->getElementsByTagName("img");
foreach( $images as $image ) {
echo $image->getAttribute("src");
echo "<br>";
}
?>

trying to scrape all facebook links from a web page

I'm trying to scrape the page for links from Facebook. However, I get a blank page, without any error message.
My code is as follows:
<?php
error_reporting(E_ALL);
function getFacebook($html) {
$matches = array();
if (preg_match('~^https?://(?:www\.)?facebook.com/(.+)/?$~', $html, $matches)) {
print_r($matches);
}
}
$html = file_get_contents('http://curvywriter.info/contact-me/');
getFacebook($html);
What's wrong with it?

A better alternative (and more robust) would be to use DOMDocument and DOMXPath:
<?php
error_reporting(E_ALL);
function getFacebook($html) {
$dom = new DOMDocument;
#$dom->loadHTML($html);
$query = new DOMXPath($dom);
$result = $query->evaluate("(//a|//A)[contains(#href, 'facebook.com')]");
$return = array();
foreach ($result as $element) {
/** #var $element DOMElement */
$return[] = $element->getAttribute('href');
}
return $return;
}
$html = file_get_contents('http://curvywriter.info/contact-me/');
var_dump(getFacebook($html));
For your specific problem, however, I did the following things:
Change preg_match to preg_match_all, in order to not stop after the first finding.
Removed the ^ (start) and $ (end) characters from the pattern. Your links will appear in the middle of the document, not in the beginning or end (definitely not both!)
So the corrected code:
<?php
error_reporting(E_ALL);
function getFacebook($html) {
$matches = array();
if (preg_match_all('~https?://(?:www\.)?facebook.com/(.+)/?~', $html, $matches)) {
print_r($matches);
}
}
$html = file_get_contents('http://curvywriter.info/contact-me/');
getFacebook($html);

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

how to find urls under double quote - php

Try this <?php function extract($url){ $sites_html = file_get_contents("$url"); $html = new DOMDocument(); $$html->loadHTML($sites_html); foreach ($html->loadHTML($sites_html) as $row) { if($row=="wanted_url") { echo $row; } } } ?>

you can use regex: try this code $matches = [[],[]]; preg_match_all('/\"wanted_url\": \"([^\"]*?)\"/', $sites_html, $matches); foreach($matches[1] as $match) { echo $match; } this will print all urls with wanted_url tag

Related

Optimize remote page retrieving and parsing

PHP Web Crawler doesn't crawl .php files

Simple HTML DOM Not Finding DIV

scrapr image url from wikipedia page

trying to scrape all facebook links from a web page

Categories

Resources