I want to find a specific url in a html page and get it's part.
the url is in this page :
http://site1.com/games/arcade/139173-angry-birds-friends-1-7-0.html`
and is like
http://download.site2.org/?server=2&apkid=com.rovio.angrybirdsfriends&ver=1.7.0
I want 3 parts of it:
2
com.rovio.angrybirdsfriends
1.7.0
My code:
$html = file_get_contents("http://site1.com/games/name/139173-angry-birds-friends-1-7-0.html");
preg_match("/download(.*)/", $html, $results)
echo = $results[0];
Is this what you are looking for?
$url = 'http://download.site2.org/?server=2&apkid=com.rovio.angrybirdsfriends&ver=1.7.0';
$query = parse_url($url, PHP_URL_QUERY);
parse_str($query, $params);
echo $params['server'], PHP_EOL;
echo $params['apkid'], PHP_EOL;
echo $params['ver'], PHP_EOL;
Output:
2
com.rovio.angrybirdsfriends
1.7.0
Update
// Read HTML
$html = file_get_contents(
'http://getandroidapp.org/games/arcade/'
. '139173-angry-birds-friends-1-7-0.html'
);
// Turn HTML into a DOM document
$dom = new DOMDocument();
#$dom->loadHTML($html); // Mute warnings
// Find anchor ...
foreach ($dom->getElementsByTagName('a') as $link) {
$href = $link->getAttribute('href');
// ... having a query part that starts with 'server='
if (preg_match('#\?server=#', $href)) {
$url = $href;
// Parse query string from href
$query = parse_url($url, PHP_URL_QUERY);
parse_str($query, $params);
// Display values
echo $params['server'], PHP_EOL;
echo $params['apkid'], PHP_EOL;
echo $params['ver'], PHP_EOL;
// One is enough
break;
}
}
Output:
2
com.rovio.angrybirdsfriends
1.7.0
It is not totally fool-proof but maybe good enough in your case.
Related
I want to remove URLs of certain sites within a string
I used this:
<?php
$URLContent = '<p>Google</p><p>AnotherSite</p>';
$LinksToRemove = array('google.com', 'yahoo.com', 'msn.com');
$LinksToCheck = in_array('google.com' , $LinksToRemove);
if (strpos($URLContent, $LinksToCheck) !== 0) {
$URLContent = preg_replace('#<a.*?>([^>]*)</a>#i', '$1', $URLContent);
}
echo $URLContent;
?>
In this example, I want to remove URLs of google.com, yahoo.com and msn.com websites only if any of them found in string $URLContent, but keep any other links.
The result of the previous code is:
<p>Google</p><p>AnotherSite</p>
but I want it to be:
<p>Google</p><p>AnotherSite</p>
One solution would be to explode your $URLContent and compare for each value in $LinksToCheck.
It could be like this :
<?php
$URLContent = '<p>Google</p><p>AnotherSite</p>';
$urlList = explode('</p>', $URLContent);
$LinksToRemove = array('google.com', 'yahoo.com', 'msn.com');
$urlFormat = [];
foreach ($urlList as $url) {
foreach ($LinksToRemove as $link) {
if (str_contains($url, $link)) {
$url = '<p>' . ucfirst(str_replace('.com', '', $link)) . '</p>';
break;
}
}
$urlFormat[] = $url;
}
$result = implode('', $urlFormat);
I want to grab the URL with highest pg value:
$html ='
';
I use this regex to locate the appropriate links:
preg_match_all('/<a.*href="\.\/\?pg=(\d+)".*>(?:.*)<\/a>/U', $html, $preg_matches);
Sometimes, the links include another parameter:
http://example.com/?pg=3&test=1
My question is, how do I adjust my regex so links with the added parameters are included as well?
Use a DOM parser to find the anchors.
Use parse_url to parse the urls and get the query value
use parse_str to get the query values
Example:
$dom = new DOMDocument;
$dom->loadHTML($html);
$html ='
';
$anchors = $dom->getElementsByTagName('a');
foreach ($anchors as $anchor) {
$url = $anchor->getAttribute('href');
$query = parse_url($url, PHP_URL_QUERY);
parse_str($query, $output);
$pg = $output['pg'];
//do something
}
Here's a helpful tutorial for PHP. http://htmlparsing.com/php.html
Also see here, why you should not use Regex for parsing html https://stackoverflow.com/a/1732454/81785
$html ='
';
preg_match_all('/<a[^>]+href=\"(.*?)\"[^>]*>(.*)?<\/a>/', $html, $out);
$result = null;
foreach ($out[1] as $link){
parse_str(parse_url($link, PHP_URL_QUERY), $atr);
$result[$link] = $atr['pg'];
}
print_r($result);
// "http://example.com/?pg=1" => "1"
// "http://example.com/?pg=2" => "2"
// "http://example.com/?pg=4&test=1" => "4"
I am trying to get all links of contains specific url page on a given page using PHPQuery. I am using the PHP support syntax of PHPQuery.
include_once 'phpQuery.php';
$url = 'http://www.phonearena.com/phones/manufacturer/';
$doc = phpQuery::newDocumentFile($url);
$urls = $doc['a'];
foreach ($urls as $url) {
echo pq($url)->attr('href') . '<br>';
}
The code above works . But it shows all the links
I want to show only those containing "/phones/manufacturer/".
I tried this but it shows nothing:
include_once 'phpQuery.php';
$url = 'http://www.phonearena.com/phones/manufacturer/';
$doc = phpQuery::newDocumentFile($url);
$urls = $doc['a'];
foreach ($urls as $url) {
echo pq($url)->attr('href:contains("/phones/manufacturer/")') . '<br>';
}
Use below coding get all urls from that site,
$doc = new DOMDocument();
#$doc->loadHTML(file_get_contents('http://www.phonearena.com/phones/manufacturer/'));
$ahreftags = $doc->getElementsByTagName('a');
foreach ($ahreftags as $tag) {
echo "<br/>";
echo $tag->getAttribute('href');
echo "<br/>";
}
exit;
Try this, a little italian guide, jquery documentation
include_once 'phpQuery.php';
$url = 'http://www.phonearena.com/phones/manufacturer/';
$doc = phpQuery::newDocumentFile($url);
$urls = $doc['a[href*="/phones/manufacturer/"]'];
foreach ($urls as $url) {
echo pq($url)->attr('href') . '<br>';
}
I am using Simple HTML Dom, trying to get strings from a website. When I print out $title[0] within the function it shows just one string, but when I safe it in the return array and print out the return value, I receive a never ending text with RECURSION.
I don't understand why it would work with the second variable $oTitle.
<?php
include 'scripts/simple_html_dom.php';
function getDetails($id) {
$url = "http://www.something.com";
$html = file_get_html ( $url );
$title = $html->find('span[itemprop=name]');
print_r($title[0] . PHP_EOL); //prints out the correct title
$oTitle = "Something"; //there is also code for this variable but it works as it should
$details = array("Title" => $title[0], "Original Title" => $oTitle);
return $details;
flush ();
}
$values = getDetails($number);
print_r($values); //code breakes here
?>
Take a look at this page: http://simplehtmldom.sourceforge.net/
As I can see, you're using this parser.
In order to get HTML content you should use something like this:
// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
// Find all links
foreach($html->find('a') as $element)
echo $element->href . '<br>';
In order to drop content, you should use something like this:
// Dump contents (without tags) from HTML
echo file_get_html('http://www.google.com/')->plaintext;
Try this code:
<?php
include 'simple_html_dom.php';
function getDetails() {
$url = "http://www.godaddy.com";
$html = file_get_html ( $url );
$title = getTitle($url);
echo $title; //prints out the correct title
$oTitle = "Something"; //there is also code for this variable but it works as it should
$details = array("Title" => $title, "Original Title" => $oTitle);
return $details;
flush ();
}
function getTitle($Url){
$str = file_get_contents($Url);
if(strlen($str)>0){
preg_match("/\<title\>(.*)\<\/title\>/",$str,$title);
return $title[1];
}
}
$values = getDetails();
print_r($values); //code breakes here
?>
Thanks for taking the time to read my post... I'm trying to extract some information from my website using Simple HTML Dom...
I have it reading from the HTML source ok, now I'm just trying to extract the information that I need. I have a feeling I'm going about this in the wrong way... Here's my script...
<?php
include_once('simple_html_dom.php');
// create doctype
$dom = new DOMDocument("1.0");
// display document in browser as plain text
// for readability purposes
//header("Content-Type: text/plain");
// create root element
$xmlProducts = $dom->createElement("products");
$dom->appendChild($xmlProducts);
$html = file_get_html('http://myshop.com/small_houses.html');
$html .= file_get_html('http://myshop.com/medium_houses.html');
$html .= file_get_html('http://myshop.com/large_houses.html');
//Define my variable for later
$product['image'] = '';
$product['title'] = '';
$product['description'] = '';
foreach($html->find('img') as $src){
if (strpos($src->src,"http://myshop.com") === false) {
$src->src = "http://myshop.com/$src->src";
}
$product['image'] = $src->src;
}
foreach($html->find('p[class*=imAlign_left]') as $description){
$product['description'] = $description->innertext;
}
foreach($html->find('span[class*=fc3]') as $title){
$product['title'] = $title->innertext;
}
echo $product['img'];
echo $product['description'];
echo $product['title'];
?>
I put echo's on the end for sake of testing...but I'm not getting anything... Any pointers would be a great HELP!
Thanks
Charles
file_get_html() returns a HTMLDom Object, and you cannot concatenate Objects, although HTMLDom have __toString methods when there concatenated there more then lilly corrupt in some way, try the following:
<?php
include_once('simple_html_dom.php');
// create doctype
$dom = new DOMDocument("1.0");
// display document in browser as plain text
// for readability purposes
//header("Content-Type: text/plain");
// create root element
$xmlProducts = $dom->createElement("products");
$dom->appendChild($xmlProducts);
$pages = array(
'http://myshop.com/small_houses.html',
'http://myshop.com/medium_houses.html',
'http://myshop.com/large_houses.html'
)
foreach($pages as $page)
{
$product = array();
$source = file_get_html($page);
foreach($source->find('img') as $src)
{
if (strpos($src->src,"http://myshop.com") === false)
{
$product['image'] = "http://myshop.com/$src->src";
}
}
foreach($source->find('p[class*=imAlign_left]') as $description)
{
$product['description'] = $description->innertext;
}
foreach($source->find('span[class*=fc3]') as $title)
{
$product['title'] = $title->innertext;
}
//debug perposes!
echo "Current Page: " . $page . "\n";
print_r($product);
echo "\n\n\n"; //Clear seperator
}
?>