PHP Parser - Find String in HTML - php

I want to find a string on another website. I have been looking at parsers and I do not know the best way to do it. I looked at an HTML DOM parser but I need just a simple one line output. I just want to get the link "url: 'http://s2.example.com/streams/i23374.mp4?k=12f34588cf171f3bbf3d35da4db43b06'" to a variable.
<script>
flowplayer("player", "http://www.example.com/flowplayer-3.2.16.swf", {
canvas: {
backgroundGradient: "none",
backgroundColor: "#000000"
},
clip: {
provider: 'lighttpd',
url: 'http://s1.example.com/streams/i23374.mp4?k=12f34588cf171f3bbf3d35da4db43b06',
scaling: 'fit'
},
plugins: {
lighttpd: {
url: 'http://www.example.com/flowplayer.pseudostreaming-3.2.12.swf'
}
}
});
</script>

Here's a handy function for grabbing the text from between two delimiters;
<?php
function extract_unit($string, $start, $end)
{
$pos = stripos($string, $start);
$str = substr($string, $pos);
$str_two = substr($str, strlen($start));
$second_pos = stripos($str_two, $end);
$str_three = substr($str_two, 0, $second_pos);
$unit = trim($str_three); // remove whitespaces
return $unit;
}
echo extract_unit($webpageSource, 'flowplayer("player", "', '", {');
?>

I would use DOMDocument:
For getting a link off of an anchor, it's:
$dd = new DOMDocument;
#$dd->loadHTMLFile('http://s2.example.com/streams/i23374.mp4?k=12f34588cf171f3bbf3d35da4db43b06');
if($a = $dd->getElementsByTagName('a')){
foreach($a as $t){
$links[] = $t->getAttribute('href');
}
}
Now $links is an Array with each href, or if(!isset($links)) there are no results.
To get JSON from a script tag:
$dd = new DOMDocument;
#$dd->loadHTMLFile('http://s2.example.com/streams/i23374.mp4?k=12f34588cf171f3bbf3d35da4db43b06');
if($s = $dd->getElementsByTagName('script')){
$c = $dd->sameHTML($s->item(0)));
}
Change item(0) to the level where the script tag is on their page. Now $c is a String. So:
preg_match_all("/url: '.+'/", $c, $results);
$results is an Array should contain url: 'whatever'.
So:
foreach($results as $v){
$a[] = preg_replace('/url: /', '', $v);
}
$a is Array of results.

Mostly RegExp is the best way to parse string, although it's not recommended to handle JSON.
Here's an example(I encoded the string, it's the same as your raw HTML):
<?php
$data = base64_decode("PHNjcmlwdD4KICAgICAgICAgICAgICAgIGZsb3dwbGF5ZXIoInBsYXllciIsICJodHRwOi8vd3d3LmV4YW1wbGUuY29tL2Zsb3dwbGF5ZXItMy4yLjE2LnN3ZiIsICB7CiAgICAgICAgICAgICAgICAgICAgY2FudmFzOiB7CiAgICAgICAgICAgICAgICAgICAgICAgIGJhY2tncm91bmRHcmFkaWVudDogIm5vbmUiLAogICAgICAgICAgICAgICAgICAgICAgICBiYWNrZ3JvdW5kQ29sb3I6ICIjMDAwMDAwIgogICAgICAgICAgICAgICAgICAgIH0sCiAgICAgICAgICAgICAgICAgICAgY2xpcDogewogICAgICAgICAgICAgICAgICAgICAgICBwcm92aWRlcjogJ2xpZ2h0dHBkJywKICAgICAgICAgICAgICAgICAgICAgICAgdXJsOiAnaHR0cDovL3MxLmV4YW1wbGUuY29tL3N0cmVhbXMvaTIzMzc0Lm1wND9rPTEyZjM0NTg4Y2YxNzFmM2JiZjNkMzVkYTRkYjQzYjA2JywKICAgICAgICAgICAgICAgICAgICAgICAgc2NhbGluZzogJ2ZpdCcKICAgICAgICAgICAgICAgICAgICB9LAogICAgICAgICAgICAgICAgICAgIHBsdWdpbnM6IHsKICAgICAgICAgICAgICAgICAgICAgICAgbGlnaHR0cGQ6IHsKICAgICAgICAgICAgICAgICAgICAgICAgICAgIHVybDogJ2h0dHA6Ly93d3cuZXhhbXBsZS5jb20vZmxvd3BsYXllci5wc2V1ZG9zdHJlYW1pbmctMy4yLjEyLnN3ZicKICAgICAgICAgICAgICAgICAgICAgICAgfQogICAgICAgICAgICAgICAgICAgIH0KICAgICAgICAgICAgICAgIH0pOwogICAgICAgICAgICA8L3NjcmlwdD4=");
if(preg_match('/clip:\s*\{[\s\S]+url:\s*\'(\S+)\',\s*scaling/', $data, $match) === 1)
echo $match[1];
?>
Although it's encoded in JSON, it can't be parsed by PHP's json_decode because PHP's JSON format is too strict (attributes should be wrapped in quotes).

Related

Conditional search and replace using PHP and regex

I need to hide all "p" tags in a HTML file that have an inline style with a "left" offset of 400 or more.
I'm hoping some clever regex will replace "left:XXX" with "display:none" should "xxx" be 400 or more.
For example, this:
<p style="position:absolute;top:98px;left:472px;white-space:nowrap">
...would need to be replaced with this:
<p style="position:absolute;top:98px;display:none;white-space:nowrap">
It seems simple enough logic, but the regex and PHP is mind boggling for me.
Here is what I've been trying to do, but I can only get it to work line-by-line:
$width = preg_match("left:(.*?)px",$contents);
if ($width >399)
{
$contents = preg_replace('/left:(.*?)px/', "display:none", $contents);
}
Any suggestions greatly appreciated! :)
Wonko
Don't believe that regex will solve all the problem of the world:
Use DOMDocument to extract the p tags with a style attribute, extract the "left" value with a regex pattern from the style attribute and then proceed to the replacement when the "left" value is greater or equal to 400 (test this with a simple comparison).
$dom = new DOMDocument;
$dom->loadHTML($html);
$pTags = $dom->getElementsByTagName('p');
foreach($pTags as $pTag) {
if ($pTag->hasAttribute('style')) {
$style = $pTag->getAttribute('style');
$style = preg_replace_callback(
'~(?<=[\s;]|^)left\s*:\s*(\d+)\s*px\s*(?:;|$)~i',
function ($m) {
return ($m[1] > 399) ? 'display:none;' : $m[0];
},
$style
);
$pTag->setAttribute('style', $style);
}
}
$result = $dom->saveHTML();
EDIT: in the worst scenario, the style attribute may contain display:block; or display with a value other than none after the left value. To avoid any problem, it is better to put display:none at the end.
$style = preg_replace_callback(
'~(?<=[\s;]|^)left\s*:\s*(\d+)\s*px\s*(;.*|$)~i',
function ($m) {
return ($m[1] > 399) ? $m[2]. 'display:none;' : $m[0];
},
$style
);
I've tested it and it works correctly:
$string = '<p style="position:absolute;top:98px;left:472px;white-space:nowrap">';
$test = str_replace('left:', 'display:none;[', $string );
$test = str_replace('white-space', ']white-space', $test );
$out = delete_all_between('[', ']', $test);
print($out); // output
function delete_all_between($beginning, $end, $string) {
$beginningPos = strpos($string, $beginning);
$endPos = strpos($string, $end);
if ($beginningPos === false || $endPos === false) {
return $string;
}
$textToDelete = substr($string, $beginningPos, ($endPos + strlen($end)) - $beginningPos);
return str_replace($textToDelete, '', $string);
}
output:
<p style="position:absolute;top:98px;display:none;white-space:nowrap">
enjoy it ... !

Php variable into a XML request string

I have the below code wich is extracting the Artist name from a XML file with the ref asrist code.
<?php
$dom = new DOMDocument();
$dom->load('http://www.bookingassist.ro/test.xml');
$xpath = new DOMXPath($dom);
echo $xpath->evaluate('string(//Artist[ArtistCode = "COD Artist"] /ArtistName)');
?>
The code that is pulling the artistcode based on a search
<?php echo $Artist->artistCode ?>
My question :
Can i insert the variable generated by the php code into the xml request string ?
If so could you please advise where i start reading ...
Thanks
You mean the XPath expression. Yes you can - it is "just a string".
$expression = 'string(//Artist[ArtistCode = "'.$Artist->artistCode.'"]/ArtistName)'
echo $xpath->evaluate($expression);
But you have to make sure that the result is valid XPath and your value does not break the string literal. I wrote a function for a library some time ago that prepares a string this way.
The problem in XPath 1.0 is that here is no way to escape any special character. If you string contains the quotes you're using in XPath it breaks the expression. The function uses the quotes not used in the string or, if both are used, splits the string and puts the parts into a concat() call.
public function quoteXPathLiteral($string) {
$string = str_replace("\x00", '', $string);
$hasSingleQuote = FALSE !== strpos($string, "'");
if ($hasSingleQuote) {
$hasDoubleQuote = FALSE !== strpos($string, '"');
if ($hasDoubleQuote) {
$result = '';
preg_match_all('("[^\']*|[^"]+)', $string, $matches);
foreach ($matches[0] as $part) {
$quoteChar = (substr($part, 0, 1) == '"') ? "'" : '"';
$result .= ", ".$quoteChar.$part.$quoteChar;
}
return 'concat('.substr($result, 2).')';
} else {
return '"'.$string.'"';
}
} else {
return "'".$string."'";
}
}
The function generates the needed XPath.
$expression = 'string(//Artist[ArtistCode = '.quoteXPathLiteral($Artist->artistCode).']/ArtistName)'
echo $xpath->evaluate($expression);

How to get value inside <a tag using preg match all?

i got html content that need to extract values inside hyperlink tag using preg match all. I tried the following but i don't get any data. I included a sample input data. Could you guys help me fix this code and print all values in front of play.asp?ID=(example: i want to get this value 12345 from play.asp?ID=12345) ?
sample input html data:
<span id="Img_1"></span></TD>
and the code
$regexp = "<A\s[^>]*HREF=\"play.asp(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/A>";
if(preg_match_all("/$regexp/siU", $input, $matches))
{
$url=str_replace('?ID=', '', $matches[2]);
$url2=str_replace('&Selected_ID=&PhaseID=123', '', $url);
print_r($url2);
}
$str = '<span id="Img_1"></span>';
preg_match_all( '/<\s*A[^>]HREF="(.*?)"\s?(.*?)>/i', $str, $match);
print_r( $match );
Try out this.
Don't! Regular expressions are a (bad) way of text processing. This is not text, but HTML sourcecode. The tools to cope with it are called HTML parsers. Although PHP's DOMDocument is also able to loadHTML, it may glitch on some rare cases. A poorly built regexp (and you are wrong to think there's any other) will glitch on almost any changes in the page.
Isnt this enough?
/<a href="(.*?)?"/I
EDIT:
This seems to work:
'/<a href="(.*?)\?/i'
this should achieve the desired result. it's a combination of an HTML parser and a contents extraction function:
function extractContents($string, $start, $end)
{
$pos = stripos($string, $start);
$str = substr($string, $pos);
$str_two = substr($str, strlen($start));
$second_pos = stripos($str_two, $end);
$str_three = substr($str_two, 0, $second_pos);
$extractedContents = trim($str_three);
return $extractedContents;
}
include('simple_html_dom.php');
$html = file_get_html('http://siteyouwantlinksfrom.com');
$links = $html->find('a');
foreach($links as $link)
{
$playIDs[] = extractContents($link->href, 'play.asp?ID=', '&');
}
print_r($playIDs);
you can download simple_html_dom.php from here
You shouldn't use Regular Expression to parse HTML.
This is a solution with DOMDocument :
<?php
$input = '<span id="Img_1"></span>';
// Clean "&" element in href
$cleanInput = str_replace('&','&',$input);
// Load HTML
$domDocument = new DOMDocument();
$domDocument->loadHTML($cleanInput);
// Retrieve <a /> tags
$aTags = $domDocument->getElementsByTagName('a');
foreach($aTags as $aTag)
{
$href = $aTagA->getAttribute('href');
$url = parse_url($href);
$vars = array();
parse_str($url['query'], $vars);
var_dump($vars);
}
?>
Output :
array (size=3)
'ID' => string '12345' (length=5)
'Selected_ID' => string '' (length=0)
'PhaseID' => string '123' (length=3)

get attribute values with php dom

I try to get some attiributue values. But have no chance. Below yo can see my code and explanation. How to get duration, file etc.. values?
$url="http://www.some-url.ltd";
$dom = new DOMDocument;
#$dom->loadHTMLFile($url);
$xpath = new DOMXPath($dom);
$the_div = $xpath->query('//div[#id="the_id"]');
foreach ($the_div as $rval) {
$the_value = trim($rval->getAttribute('title'));
echo $the_value;
}
The output below:
{title:'title',
description:'description',
scale:'fit',keywords:'',
file:'http://xxx.ccc.net/ht/2012/05/10/419EE45F98CD63F88F52CE6260B9E85E_c.mp4',
type:'flv',
duration:'24',
screenshot:'http://xxx.ccc.net/video/2012/05/10/419EE45F98CD63F88F52CE6260B9E85E.jpg?v=1336662169',
suggestion_path:'/videoxml/player_xml/61319',
showSuggestions:true,
autoStart:true,
width:412,
height:340,
autoscreenshot:true,
showEmbedCode:true,
category: 1,
showLogo:true
}
How to get duration, file etc.. values?
What about
$parsed = json_decode($the_value, true);
$duration = $parsed['duration'];
EDIT:
Since json_decode() requires proper JSON formatting (key names and values must be enclosed in double quotes), we should fix original formatting into the correct one. So here is the code:
function my_json_decode($s, $associative = false) {
$s = str_replace(array('"', "'", 'http://'), array('\"', '"', 'http//'), $s);
$s = preg_replace('/(\w+):/i', '"\1":', $s);
$s = str_replace('http//', 'http://', $s);
return json_decode($s, $associative);
}
$parsed = my_json_decode($var, true);
Function my_json_decode is taken from this answer, slightly modified.

I want to modify the withdrawal of an array of strings where the start and end are found

I want to modify the withdrawal of an array of strings where the start and end are found
<?php
$file = ('http://gdata.youtube.com/feeds/base/users/BBCArabicNews/uploads?alt=rss&v=2&orderby=published&client=ytapi-youtube-profile');
$string=file_get_contents($file);
function findinside($start, $end, $string) {
preg_match_all('/' . preg_quote($start,'/') . '(.+?)'. preg_quote($end, '/').'/si', $string, $m);
return $m[1];
}
$start = ':video:';
$end = '</guid>';
$out = findinside($start, $end, $string);
$out = findinside($start, $end, $string);
foreach($out as $string){
echo $string;
echo "<p></td>\n";
}
?>
Results
Q80QSzgPDD8
ozei4GysBN8
ak3bbs_UxP0
rUs-r3ilTG4
p4BO6FI5sPY
j5lclrPzeVU
dK5VWTYsJaM
mERug-d536k
h0zqd3bC0-E
ije5kuSfLKY
H9XXMPvEpHM
EK5UoQqYl4U
This works properly in withdrawing of an array of strings I want to add also
$start = '</pubDate><atom:updated>';
$end = '</atom:updated>';
I want to be Show two array of strings
Example
xSD0XJLkLQid
2011-11-08T17:36:14.000Z
bFU066NwVnD
2011-12-08T17:36:14.000Z
Can I do this with this code
Greetings
You can use PHP's DOMDocument parser like this:
$objDOM = new DOMDocument();
$objDOM->load($file); // the long one from youtube
$dates = $objDOM->getElementsByTagName("pubDate");
foreach ($dates as $node)
{
echo $node->nodeValue;
}
Use a DOM parser and then a regex parser in individual elements in the DOM (using things like getElementById()). It works better and is more failsafe.

Categories