Finding URL from <a> using PHP [duplicate] - php

This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 12 days ago.
I need to find every occurrence of URL coded in "href" part of html tag using PHP.
As result, I want to get array of every url. Tried a little of this, but it finds only "href=" starting thing. I know that my code is very basic, but I don't know how to improve or change this, to make it works. Thanks for all help.
<?php
$array = [];
$string = file_get_contents("file.html");
$begin = 0;
$end = 0;
do {
$begin = strpos($string, "<a href=\"", $end + 1);
$end = strpos($string, "\"", $begin + 6);
$array[] = substr($string, ($begin + 6), ($end - $begin - 6));
} while ($begin !== false && $end !== false);

Use DOMDocument for that, not Regex!
$html = file_get_contents('file.html');
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$tags = $xpath->query('//a');
$links = [];
foreach ($tags as $tag) {
$links[] = $tag->getAttribute('href');
}
Example

Related

Does PHP optimize multiple strlen calls on for loop?

I am trying to make a PHP function that replaces all occurrences between two strings for an another given string.
I am almost sure I have accomplished what I was looking for, however while I was programming I had two versions of it. I would like you guys explain me which one is better.
Version 1:
$html = "{tag param1|param2}<adfasdfsdf>adsfasdf<adsfasdfsd>{tag param1} {tag param2} sdfsdfadsfasdf";
$needle = array('{tag ', '}');
$placeholder = 'TEST';
$lengths = array(strlen($needle[0]), strlen($needle[1]), strlen($placeholder));
$offset = 0;
while(($startpos = strpos($html, $needle[0], $offset)) !== false){
$endpos = strpos($html, $needle[1], $startpos + $lengths[0]);
if($endpos === false) break;
$html = substr_replace($html, $placeholder, $startpos, $endpos - $startpos + 1);
$offset = $startpos + $lengths[2];
}
echo $html;
Version 2:
$html = "{tag param1|param2}<adfasdfsdf>adsfasdf<adsfasdfsd>{tag param1} {tag param2} sdfsdfadsfasdf";
$needle = array('{tag ', '}');
$placeholder = 'TEST';
$offset = 0;
while(($startpos = strpos($html, $needle[0], $offset)) !== false){
$endpos = strpos($html, $needle[1], $startpos + strlen($needle[0]));
if($endpos === false) break;
$html = substr_replace($html, $placeholder, $startpos, $endpos - $startpos + 1);
$offset = $startpos + strlen($placeholder);
}
echo $html;
This code search for all {tag ...........} occurrences and replace them for TEST.
I know at this point this might be micro optimization, however I would like to learn.
Any error you see or suggestion is welcome.

PHP - Create Arrays from Strings in another Array

I've spent a long time on this problem, and I cannot for the life of me figure it out. Any help will be much appreciated!
I have an array of strings, which I am using to get information from, i.e. the month of an event is hidden within the url (the string), and I want to cycle through all the urls and get the month, and other data out and make each data-piece it's own array.
Here's the deal, I can do it twice successfully, but the third time I try to create an array, it breaks down.
This code works:
$months = array();
$times = array();
foreach ($ticketsLinks as $ticketsLink){
//CREATE VENUE ARRAY
// Delete all the way up to "-tickets-"
$findMeA = '-tickets-';
$posA = strpos($ticketsLink, $findMeA);
$posA = $posA + 9;
$venue = substr($ticketsLink, $posA);
// Find the first number in the string - delete everything after that.
$lengthA = strlen($venue);
$parts = str_split($venue);
$first_num = -1;
$num_loc = 0;
foreach ($parts AS $a_char) {
if (is_numeric($a_char)) {
$first_num = $num_loc;
break;
}
$num_loc++;
}
$posB = -$lengthA + $num_loc - 1;
$venue = substr($venue, 0, $posB);
// Replace dashes with spaces.
$venue = str_replace("-"," ",$venue);
//Add value to venue's array
$venues[] = $venue;
// CREATE TIME ARRAY
$pos = strrpos($ticketsLink, '-');
$pos = strlen($ticketsLink) - $pos - 1;
$time = substr($ticketsLink, -$pos);
$pos = strpos($time, '/');
$time = substr($time, 0, $pos);
$times[] = $time;
}
This code does not:
$months = array();
$times = array();
$years = array();
foreach ($ticketsLinks as $ticketsLink){
//CREATE VENUE ARRAY
// Delete all the way up to "-tickets-"
$findMeA = '-tickets-';
$posA = strpos($ticketsLink, $findMeA);
$posA = $posA + 9;
$venue = substr($ticketsLink, $posA);
// Find the first number in the string - delete everything after that.
$lengthA = strlen($venue);
$parts = str_split($venue);
$first_num = -1;
$num_loc = 0;
foreach ($parts AS $a_char) {
if (is_numeric($a_char)) {
$first_num = $num_loc;
break;
}
$num_loc++;
}
$posB = -$lengthA + $num_loc - 1;
$venue = substr($venue, 0, $posB);
// Replace dashes with spaces.
$venue = str_replace("-"," ",$venue);
//Add value to venue's array
$venues[] = $venue;
// CREATE TIME ARRAY
$pos = strrpos($ticketsLink, '-');
$pos = strlen($ticketsLink) - $pos - 1;
$time = substr($ticketsLink, -$pos);
$pos = strpos($time, '/');
$time = substr($time, 0, $pos);
$times[] = $time;
// CREATE YEAR ARRAY
$pos = strrpos($ticketsLink, '-');
$pos = strlen($ticketsLink) - $pos - 1;
$year = substr($ticketsLink, -$pos);
$pos = strpos($year, '/');
$year = substr($year, 0, $pos);
$years[] = $year;
}
For the purposes of this example, I kept the code to get the year string and the time string exactly the same to show that that wasn't the problem. I've gone through the above code and tried to debug it - the only thing that's making it not work is when I push the year variable to the years array.
--- UPDATED TO INCLUDE NEW INFORMATION -----
Unfortunately, breaking apart the URL to get the requisite information is the best way to do this - the URLs are coming from a CSV feed.
So, after the first foreach, then just for a test, I run a foreach loop through the $years array -
foreach($years as $year){
echo $year;
}
Eventually, the $years array will be passed to a mySQL database, but for now, I just want to make sure I'm processing the URL correctly. The resultant loop should look something like this:
2014
2015
2014
2016
Instead, I get nothing, and all the code after the first foreach (where I'm breaking down the URL), doesn't run. I have an echo at the bottom that echos "This code works!", and it doesn't print to the screen when I try to push values to the $years array.

Extract data between two strings using php

I want to extract all the results from top to bottom, i have this script...
function extract_unit($string, $start, $end)
{
$pos = stripos($string, $start);
$str = substr($string, $pos);
$str_two = substr($str, strlen($start));
$second_pos = stripos($str_two, $end);
$str_three = substr($str_two, 0, $second_pos);
$unit = trim($str_three); // remove whitespaces
return $unit;
}
$result = file_get_contents('html.txt');
echo extract_unit($result,'<dd class="m_operation_In">','</dd>');
echo extract_unit($result,'<span class="Unread">','</span>');
This code is working perfect but it is giving me only the first outcome. I want all the outcomes. There are at least 7-8 results that need to be fetched in a single go. I m not sure what to do now. Any kind of help will be appreciated. Thanks
stripos has a third parameter which you can use to fetch later occurrences of your search string. You can loop through all the matches like so:
function extract_unit($string, $start, $end)
{
$offset = 0;
$units = array();
while(($pos = stripos($string, $start) !== false) {
$str = substr($string, $pos, $offset);
$str_two = substr($str, strlen($start));
$second_pos = stripos($str_two, $end);
$str_three = substr($str_two, 0, $second_pos);
$units[] = trim($str_three); // remove whitespaces
$offset = $pos + strlen($start);
}
return $units;
}
Note that this approach is extremely brittle - if the html changes even slightly, this code will break; you'd be better off using an html parser to pull out the contents of these divs. The http://www.php.net/manual/en/class.domdocument.php library would be a good place to start.
edit
Using simple_html_dom, you'll just use a css-style selector to get elements you're looking for:
// Create DOM from URL or file
$html = file_get_html('https://www.pourtoi.com.au/link.txt');
// Find all m_operation_In
foreach($html->find('dd.m_operation_In') as $element) {
echo $element->src . '<br>';
}
foreach($html->find('span.Unread') as $span) {
echo $span->src . '<br/>';
}

Is it possible to query the first 5 images with DOMDocument?

Is it possible to query the first 5 images with DOMDocument?
$dom = new DOMDocument;
$list = $dom->query('img');
With XPath You can fetch all images like this:
$xpath = new DOMXPath($dom);
$list = $xpath->query('//img');
Then you limit the results by only iterating over the first five.
for ($i = 0, $n = min(5, $list->length); $i < $n; ++$i) {
$node = $list->item(0);
}
XPath is very versatile thanks to its expression language. However, in this particular case, you may not need all that power and a simple $list = $dom->getElementsByTagName('img') would yield the same result set.
You can use getElementsByTagName to build and array of images:
$dom = new DOMDocument();
$dom->loadHTML($string);
$images = $dom->getElementsByTagName('img');
$result = array();
for ($i=0; $i<5; $i++){
$node = $images->item($i);
if (is_object( $node)){
$result[] = $node->ownerDocument->saveXML($node);
}
}

Removing everything from string outside specified tags (PHP)

Question has been updated to exclude regex as a possible solution.
I'm trying to build a php function which will allow me strip everything outside of specified tags while preserving the specified tags and their content and am not sure how to do this...
For example:
$string = "lorem ipsum <div><p>Some video content</p><object></object></div><p>dolor sit</p> amet <img>"
some_function($string, "<div><img>");
returns: "<div><p>Some video content</p><object></object></div><img>"
Thanks for any help!
Ok, so I think I figured out a way to do this based on a modified version of the explode_tags function I posted a link to above:
function explode_tags($chr, $str) {
for ($i=0, $j=0; $i < strlen($str); $i++) {
if ($str{$i} == $chr) {
while ($str{$i+1} == $chr) $i++;
$j++;
continue;
}
if ($str{$i} == "<") {
if (strlen($res[$j]) > 0) $j++;
$s = strpos($str, " ", $i);
$b = strpos($str, ">", $i);
if($s<$b) $end = $s;
else $end = $b;
$t = substr($str, $i+1, $end-$i-1);
$tend = strpos($str, ">", $i);
$tclose = strpos($str, "</".$t, $tend);
if($tclose!==false) $pos = strpos($str, ">", $tclose);
else $pos = strpos($str, ">", $i);
$res[$j] .= substr($str, $i, $pos - $i+1);
$i += ($pos - $i);
$j++;
continue;
}
if ((($str{$i} == "\n") || ($str{$i} == "\r")) && (strlen($res[$j]) == 0)) continue;
$res[$j] .= $str{$i};
}
return $res;
}
function filter_tags($content, $tags) {
$content = strip_tags($content, $tags);
$tags = substr($tags, 1, -1);
$d = strpos($tags, "><");
if($d===false) $tags = array($tags);
else $tags = explode("><", $tags);
$content = explode_tags("", $content);
$result="";
foreach($content as $c) {
$s = strpos($c, " ");
$b = strpos($c, ">");
if($s<$b) $end = $s;
else $end = $b;
$tag = substr($c, 1, $end-1);
if(in_array($tag, $tags)) $result.=$c;
}
return $result;
}
filter_tags($content, "<img><div><object><embed><iframe><param><script>");
This seems to work perfectly so far, although I have only tried it on a few different pieces of content. I'm not great at this, so if anybody has suggestions please share freely...
Thanks for all of your answers!
Jeff Atwood has a really great blog post arguing against using regex for parsing HTML. http://www.codinghorror.com/blog/2008/06/regular-expressions-now-you-have-two-problems.html
However, in this situation, it might not be a bad idea to use regex to first remove the extranious ends and then use a DOM parser to pick out the structures you want from the inside.
update based on the comment
You could use css selectors to grab the divs you are looking for, then crawl up the tree to get the outermost element of your selection.
See the zend.dom.query framework.
http://framework.zend.com/manual/en/zend.dom.query.html
Basically query for "div img" to get the img tags immediately inside div tags.
Then crawl up the tree until you reach your target position, and extract and save that node's outerHTML....
This would work in Javascript, but I don't know about php.
The caveats here are that you lose the specificity of your example above. ie: a div containing four images would have matches for all child images... You'd have to do some extra processing to ensure you're really doing what you think you are doing. However, it's a bit safer than blind string replacement.

Categories