Adjust PHP code Retrieve the DOM from a given URL

Adjust PHP code Retrieve the DOM from a given URL - php

Hello ,
im using the following code to Retrieve the DOM from URL
all "A" tags and print their HREFs
Now my output is contain "A" i dont want its my out is here
http://trend.remal.com/parsing.php
i need to clear my out to be only the name after http://twitter.com/namehere
so output print list of "namehere"
include('simple_html_dom.php');
// Retrieve the DOM from a given URL
$html = file_get_html('http://tweepar.com/sa/1/');
$urls = array();
foreach ( $html->find('a') as $e )
{
// If it's a twitter link
if ( strpos($e->href, '://twitter.com/') !== false )
{
// and we don't have it in the array yet
if ( ! in_array($urls, $e->href) )
{
// add it to our array
$urls[] = $e->href;
}
}
}
echo implode('<br>', $urls);
echo $e->href . '<br>';

Instead of simply using $urls[] = $e->href, use a regex to match the username:
preg_match('~twitter.com/(.+)~', $e->href, $matches);
$urls[] = $matches[1];

Related

PHP Regex or DOMDocument for Matching & Removing URLs?

I'm trying to extract links from html page using DOM:
$html = file_get_contents('links.html');
$DOM = new DOMDocument();
$DOM->loadHTML($html);
$a = $DOM->getElementsByTagName('a');
foreach($a as $link){
//echo out the href attribute of the <A> tag.
echo $link->getAttribute('href').'<br/>';
}
Output:
http://dontwantthisdomain.com/dont-want-this-domain-name/
http://dontwantthisdomain2.com/also-dont-want-any-pages-from-this-domain/
http://dontwantthisdomain3.com/dont-want-any-pages-from-this-domain/
http://domain1.com/page-X-on-domain-com.html
http://dontwantthisdomain.com/dont-want-link-from-this-domain-name.html
http://dontwantthisdomain2.com/dont-want-any-pages-from-this-domain/
http://domain.com/page-XZ-on-domain-com.html
http://dontwantthisdomain.com/another-page-from-same-domain-that-i-dont-want-to-be-included/
http://dontwantthisdomain2.com/same-as-above/
http://domain3.com/page-XYZ-on-domain3-com.html
I would like to remove all results matching dontwantthisdomain.com, dontwantthisdomain2.com and dontwantthisdomain3.com so the output will looks like that:
http://domain1.com/page-X-on-domain-com.html
http://domain.com/page-XZ-on-domain-com.html
http://domain3.com/page-XYZ-on-domain3-com.html
Some people saying I should not use regex for html and others that it's ok. Could somebody point the best way how I can remove unwanted urls from my html file? :)

Maybe something like this:
function extract_domains($buffer, $whitelist) {
preg_match_all("#<a\s+.*?href=\"(.+?)\".*?>(.+?)</a>#i", $buffer, $matches);
$result = array();
foreach($matches[1] as $url) {
$url = urldecode($url);
$parts = #parse_url((string) $url);
if ($parts !== false && in_array($parts['host'], $whitelist)) {
$result[] = $parts['host'];
}
}
return $result;
}
$domains = extract_domains(file_get_contents("/path/to/html.htm"), array('stackoverflow.com', 'google.com', 'sub.example.com')));
It does a rough match on the all the <a> with href=, grabs what's between the quotes, then filters it based on your whitelist of domains.

None regex solution (without potential errors :-) :
$html='
http://dontwantthisdomain.com/dont-want-this-domain-name/
http://dontwantthisdomain2.com/also-dont-want-any-pages-from-this-domain/
http://dontwantthisdomain3.com/dont-want-any-pages-from-this-domain/
http://domain1.com/page-X-on-domain-com.html
http://dontwantthisdomain.com/dont-want-link-from-this-domain-name.html
http://dontwantthisdomain2.com/dont-want-any-pages-from-this-domain/
http://domain.com/page-XZ-on-domain-com.html
http://dontwantthisdomain.com/another-page-from-same-domain-that-i-dont-want-to-be-included/
http://dontwantthisdomain2.com/same-as-above/
http://domain3.com/page-XYZ-on-domain3-com.html
';
$html=explode("\n", $html);
$dontWant=array('dontwantthisdomain.com','dontwantthisdomain2.com','dontwantthisdomain3.com');
foreach ($html as $link) {
$ok=true;
foreach($dontWant as $notWanted) {
if (strpos($link, $notWanted)>0) {
$ok=false;
}
if (trim($link=='')) $ok=false;
}
if ($ok) $final_result[]=$link;
}
echo '<pre>';
print_r($final_result);
echo '</pre>';
outputs
Array
(
[0] => http://domain1.com/page-X-on-domain-com.html
[1] => http://domain.com/page-XZ-on-domain-com.html
[2] => http://domain3.com/page-XYZ-on-domain3-com.html
)

Adjust code Retrieve the DOM from a given URL

Hello ,
im using the following code to Retrieve the DOM from URL
ind all "A" tags and print their HREFs
Now my output is contain "A" i dont want its my out is here
http://trend.remal.com/parsing.php
some elements duplicated ,
i need to clear my out to be only "A" that include https://twitter.com/$namehere
as you can see i have 2 kind of urls i need only twitter url and avoid duplicate
any tips to adjust the code
<?php
include('simple_html_dom.php');
$html = file_get_html('http://tweepar.com/sa/1/');
foreach($html->find('a') as $e)
echo $e->href . '<br>';
?>

$urls = array();
foreach ( $html->find('a') as $e )
{
// If it's a twitter link
if ( strpos($e->href, '://twitter.com/') !== false )
{
// and we don't have it in the array yet
if ( ! in_array($e->href, $urls) )
{
// add it to our array
$urls[] = $e->href;
}
}
}
echo implode('<br>', $urls);
Here are some references from the PHP docs:
strpos
in_array
implode

(PHP) Regex for finding specific href tag

i have a html document with n "a href" tags with different target urls and different text between the tag.
For example:
<span ....>lorem ipsum</span>
<span ....>example</span>
example3
<img ...>test</img>
without a d as target url
As you can see the target urls switch between "d?, d., d/d?, d/d." and between the "a tag" there could be any type of html which is allowed by w3c.
I need a Regex which gives me all links which has one of these combination in the target url:
"d?, d., d/d?, d/d." and has "Lorem" or "test" between the "a tags" in any position including sub html tags.
My Regex so far:
href=[\"\']([^>]*?/[d]+[.|\?][^"]*?[\"\'][^>]*[/]?>.*?</a>)
I tried to include the lorem / test as followed:
href=[\"\']([^>]*?/[d]+[.|\?][^"]*?[\"\'][^>]*[/]?>(lorem|test)+</a>)
but this will only works if I put a ".*?" before and after the (lorem|test) and this would be to greedy.
If there is a easier way with SimpleXml or any other DOM parser, please let me know. Otherwise I would appreciate any help with the regex.
Thanks!

Here you go:
$html = array
(
'<span ....>lorem ipsum</span>',
'<span ....>example</span>',
'example3',
'<img ...>test</img>',
'without a d as target url',
);
$html = implode("\n", $html);
$result = array();
$anchors = phXML($html, '//a[contains(., "lorem") or contains(., "test")]');
foreach ($anchors as $anchor)
{
if (preg_match('~d[.?]~', strval($anchor['href'])) > 0)
{
$result[] = strval($anchor['href']);
}
}
echo '<pre>';
print_r($result);
echo '</pre>';
Output:
Array
(
[0] => http://www.example.com/d?12345abc
[1] => http://www.example.com/d/d.1234
)
The phXML() function is based on my DOMDocument / SimpleXML wrapper, and goes as follows:
function phXML($xml, $xpath = null)
{
if (extension_loaded('libxml') === true)
{
libxml_use_internal_errors(true);
if ((extension_loaded('dom') === true) && (extension_loaded('SimpleXML') === true))
{
if (is_string($xml) === true)
{
$dom = new DOMDocument();
if (#$dom->loadHTML($xml) === true)
{
return phXML(#simplexml_import_dom($dom), $xpath);
}
}
else if ((is_object($xml) === true) && (strcmp('SimpleXMLElement', get_class($xml)) === 0))
{
if (isset($xpath) === true)
{
$xml = $xml->xpath($xpath);
}
return $xml;
}
}
}
return false;
}
I'm too lazy not to use this function right now, but I'm sure you can get rid of it if you need to.

Here is a Regular Expression which works:
$search = '/<a\s[^>]*href=["\'](?:http:\/\/)?(?:[a-z0-9-]+(?:\.[a-z0-9-]+)*)\/(?:d\/)?d[?.].*?>.*?(?:lorem|test)+.*?<\/a>/i';
$matches = array();
preg_match_all($search, $html, $matches);
The only thing is it relies on there being a new-line character between each ` tag. Otherwise it will match something like:
example3<img ...>test</img>

Use an HTML parser. There are lots of reasons that Regex is absolutely not the solution for parsing HTML.
There's a good list of them here:
Robust and Mature HTML Parser for PHP

Will print only first and fourth link because two conditions are met.
preg_match_all('#href="(.*?)"(.*?)>(.*?)</a>#is', $string, $matches);
$count = count($matches[0]);
unset($matches[0], $matches[2]);
for($i = 0; $i < $count; $i++){
if(
strpos($matches[1][$i], '/d') !== false
&&
preg_match('#(lorem|test)#is', $matches[3][$i]) == true
)
{
echo $matches[1][$i];
}
}

why do i get query twice

I'm trying to basically extract the ?v= (the query part) of the youtube.com url... it's to automatically embed the video when someone types in a youtube.com URI (i.e. someone will type in http://www.youtube.com/?v=xyz, this program should embed it into the page automatically).
Anyway when I run the following code, I get two QUERY(ies) for the first URI:
<?php
//REGEX CONTROLLER:
//embedding youtube:
function youtubeEmbedd($text)
{
//scan text and find:
// http://www.youtube.com/
// www.youtube.com/
$youtube_pattern = "(http\:\/\/www\.youtube\.com\/(watch)??\?v\=[a-zA-Z0-9]+(\&[a-z]\=[a-zA-Z0-9])*?)"; // the pattern
#"http://www.youtube.com/?v="
echo "<hr/>";
$links = preg_match_all($youtube_pattern, $text, $out, PREG_SET_ORDER); // use preg_replace here
if ($links)
{
for ($i = 0; $i != count($out); $i++)
{
echo "<b><u> URL </b><br/></u> ";
foreach ($out[$i] as $url)
{
// split url[QUERY] here and replaces it with embed code:
$youtube = parse_url($url);
echo "QUERY: " . $youtube["query"] . "<br/>";
#$pos = strpos($url, "?v=");
}
}
}
else
{
echo "no match";
}
}
youtubeEmbedd("tthe quick gorw fox http://www.youtube.com/watch?v=5qm8PH4xAss&x=4dD&k=58J8 and http://www.youtube.com/?v=Dd3df4e ");
?>
Output is:
URL
QUERY: v=5qm8PH4xAss
QUERY: << WHY DOES THIS APPEAR????????????
URL
QUERY: v=Dd3df4e
I would be greatful for any help.

Your regular expression stores a result in $out like this:
(
[0] => Array
(
[0] => http://www.youtube.com/watch?v=5qm8PH4xAss
[1] => watch
)
[1] => Array
(
[0] => http://www.youtube.com/?v=Dd3df4e
)
)
Your regular expression has a subgroup for matching the text watch, and so this ends up as a result in the array.
Since you iterate through all results $out[$i] you're trying to run parse_url on the second result of the first match; this leads to an empty output.
To fix your issue, simple change your iteration to something like:
if($links){
foreach($out as $result){
$youtube = parse_url($result[0]);
echo "<b><u> URL </b><br/></u> QUERY: " . $youtube["query"] . "<br/>";
}
}

Need help with preg_replace interpreting {variables} with parameters

I want to replace
{youtube}Video_ID_Here{/youtube}
with the embed code for a youtube video.
So far I have
preg_replace('/{youtube}(.*){\/youtube}/iU',...)
and it works just fine.
But now I'd like to be able to interpret parameters like height, width, etc. So could I have one regex for this whether is does or doesn't have parameters? It should be able to inperpret all of these below...
{youtube height="200px" width="150px" color1="#eee" color2="rgba(0,0,0,0.5)"}Video_ID_Here{/youtube}
{youtube height="200px"}Video_ID_Here{/youtube}
{youtube}Video_ID_Here{/youtube}
{youtube width="150px" showborder="1"}Video_ID_Here{/youtube}

Try this:
function createEmbed($videoID, $params)
{
// $videoID contains the videoID between {youtube}...{/youtube}
// $params is an array of key value pairs such as height => 200px
return 'HTML...'; // embed code
}
if (preg_match_all('/\{youtube(.*?)\}(.+?)\{\/youtube\}/', $string, $matches)) {
foreach ($matches[0] as $index => $youtubeTag) {
$params = array();
// break out the attributes
if (preg_match_all('/\s([a-z0-9]+)="([^\s]+?)"/', $matches[1][$index], $rawParams)) {
for ($x = 0; $x < count($rawParams[0]); $x++) {
$params[$rawParams[1][$x]] = $rawParams[2][$x];
}
}
// replace {youtube}...{/youtube} with embed code
$string = str_replace($youtubeTag, createEmbed($matches[2][$index], $params), $string);
}
}
this code matches the {youtube}...{/youtube} tags first and then splits out the attributes into an array, passing both them (as key/value pairs) and the video ID to a function. Just fill in the function definition to make it validate the params you want to support and build up the appropriate HTML code.

You probably want to use preg_replace_callback, as the replacing can get quite convoluted otherwise.
preg_replace_callback('/{youtube(.*)}(.*){\/youtube}/iU',...)
And in your callback, check $match[1] for something like the /(width|showborder|height|color1)="([^"]+)"/i pattern. A simple preg_match_all inside a preg_replace_callback keeps all portions nice & tidy and above all legible.

I would do it something like this:
preg_match_all("/{youtube(.*?)}(.*?){\/youtube}/is", $content, $matches);
for($i=0;$i<count($matches[0]);$i++)
{
$params = $matches[1][$i];
$youtubeurl = $matches[2][$i];
$paramsout = array();
if(preg_match("/height\s*=\s*('|\")([0-9]+px)('|\")/i", $params, $match)
{
$paramsout[] = "height=\"{$match[2]}\"";
}
//process others
//setup new code
$tagcode = "<object ..." . implode(" ", $paramsout) ."... >"; //I don't know what the code is to display a youtube video
//replace original tag
$content = str_replace($matches[0][$i], $tagcode, $content);
}
You could just look for params after "{youtube" and before "}" but you open yourself up to XSS problems. The best way would be look for a specific number of parameters and verify them. Don't allow things like < and > to be passed inside your tags as someone could put do_something_nasty(); or something.

I'd not use regex at all, since they are notoriously bad at parsing markup.
Since your input format is so close to HTML/XML in the first place, I'd rely on that
$tests = array(
'{youtube height="200px" width="150px" color1="#eee" color2="rgba(0,0,0,0.5)"}Video_ID_Here{/youtube}'
, '{youtube height="200px"}Video_ID_Here{/youtube}'
, '{youtube}Video_ID_Here{/youtube}'
, '{youtube width="150px" showborder="1"}Video_ID_Here{/youtube}'
, '{YOUTUBE width="150px" showborder="1"}Video_ID_Here{/youtube}' // deliberately invalid
);
echo '<pre>';
foreach ( $tests as $test )
{
try {
$youtube = SimpleXMLYoutubeElement::fromUserInput( $test );
print_r( $youtube );
}
catch ( Exception $e )
{
echo $e->getMessage() . PHP_EOL;
}
}
echo '</pre>';
class SimpleXMLYoutubeElement extends SimpleXMLElement
{
public static function fromUserInput( $code )
{
$xml = #simplexml_load_string(
str_replace( array( '{', '}' ), array( '<', '>' ), strip_tags( $code ) ), __CLASS__
);
if ( !$xml || 'youtube' != $xml->getName() )
{
throw new Exception( 'Invalid youtube element' );
}
return $xml;
}
public function toEmbedCode()
{
// write code to convert this to proper embode code
}
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Adjust PHP code Retrieve the DOM from a given URL - php

Instead of simply using $urls[] = $e->href, use a regex to match the username: preg_match('~twitter.com/(.+)~', $e->href, $matches); $urls[] = $matches[1];

Related

PHP Regex or DOMDocument for Matching & Removing URLs?

Adjust code Retrieve the DOM from a given URL

(PHP) Regex for finding specific href tag

why do i get query twice

Need help with preg_replace interpreting {variables} with parameters

Categories

Resources