get attribute values with php dom - php

I try to get some attiributue values. But have no chance. Below yo can see my code and explanation. How to get duration, file etc.. values?
$url="http://www.some-url.ltd";
$dom = new DOMDocument;
#$dom->loadHTMLFile($url);
$xpath = new DOMXPath($dom);
$the_div = $xpath->query('//div[#id="the_id"]');
foreach ($the_div as $rval) {
$the_value = trim($rval->getAttribute('title'));
echo $the_value;
}
The output below:
{title:'title',
description:'description',
scale:'fit',keywords:'',
file:'http://xxx.ccc.net/ht/2012/05/10/419EE45F98CD63F88F52CE6260B9E85E_c.mp4',
type:'flv',
duration:'24',
screenshot:'http://xxx.ccc.net/video/2012/05/10/419EE45F98CD63F88F52CE6260B9E85E.jpg?v=1336662169',
suggestion_path:'/videoxml/player_xml/61319',
showSuggestions:true,
autoStart:true,
width:412,
height:340,
autoscreenshot:true,
showEmbedCode:true,
category: 1,
showLogo:true
}
How to get duration, file etc.. values?

What about
$parsed = json_decode($the_value, true);
$duration = $parsed['duration'];
EDIT:
Since json_decode() requires proper JSON formatting (key names and values must be enclosed in double quotes), we should fix original formatting into the correct one. So here is the code:
function my_json_decode($s, $associative = false) {
$s = str_replace(array('"', "'", 'http://'), array('\"', '"', 'http//'), $s);
$s = preg_replace('/(\w+):/i', '"\1":', $s);
$s = str_replace('http//', 'http://', $s);
return json_decode($s, $associative);
}
$parsed = my_json_decode($var, true);
Function my_json_decode is taken from this answer, slightly modified.

Related

Add space between textContent data scraped from website using PHP DOM

I am trying to add a comma and whitespace to some data I am scraping from a website. The data scrapes successfully, but they are muddled up together, and the space and comma are trying to add only get added to the last item. Here is the code I currently have
$html = curl_exec($ch);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$finder = new DomXPath($dom);
$class_ops = 'ipc-inline-list ';
$class_opp = 'ipc-inline ';
$node = $finder->query("//div[#class='$class_ops']//ul[#class='$class_opp']");
foreach ($node as $index => $t) {
if ($index == 3) {
$la = $t->textContent.", ";
}
}
echo $la;
Current Result
DoyleBrainDavid,
Expected Result
Doyle, Brain, David
I am using this code
$c1 = curl_init('https://stackoverflow.com/');
curl_setopt($c1, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($c1);
if (curl_error($c1))
die(curl_error($c1));
// Get the status code
$status = curl_getinfo($c1, CURLINFO_HTTP_CODE);
curl_close($c1);
preg_match_all('/<span(.*?)<\/span>/s', $html, $matches1);
foreach($matches1[0] as $k=>$v){
$enc = mb_detect_encoding($v);
$v = mb_convert_encoding($v,$enc, "UTF-8");
$match1[$k] = strip_tags ($v);
//$match1[$k] = preg_replace('/^[^A-Za-z0-9]+/', '', $match1[$k]);
}
var_dump($match1);
In your case you can replace like this
preg_match_all('/<div class="ipc-inline-list">(.*?)<\/div>/s', $html, $matches1);
This return array with matches.
I hope this can be helpful for you
You want each li, not the ul as one block. Try:
$node = $finder->query("//div[#class='$class_ops']//ul[#class='$class_opp']/li");
Demo: https://3v4l.org/Mvfud
If that doesn't work the actual HTML content should be added to the question.

How do I include this regular expression when creating an object?

After my $plates = explode(';', $plates); i'd like to include another regular expression when creating a new $dossier (in my foreach):
$minus = preg_replace('~[-]~', '', $license_plates);
How would I do that?
This is my code:
public function addLicensePlates(Request $request)
{
$product_id = $request->input('product_id');
$license_plates = $request->input('license_plates');
$plates = preg_replace('~\s+|[.,:;*/_]~', ';', $license_plates); // \s+|[.,:;*/_]
$plates = explode(';', $plates);
foreach($plates as &$plate) {
$dossier = new Dossier;
$dossier->license_plate = trim($plate);
$dossier->product_id = $product_id;
$dossier->save();
}
}
PS: I don't want to add the - to the $plates expression, but after the explode.
I suggest editing the existing function as follows:
Use trim() on the input string when preg_replaceing to get rid of empty values later
Use str_replace to replace the hyphens later in the foreach block.
See the PHP demo:
$license_plates = " WORD-HERE 36-LXD-5";
$plates = preg_replace('~\s+|[.,:;*/_]~', ';', trim($license_plates)); // trim the incoming value
$plates = explode(';', $plates);
foreach($plates as &$plate) {
$license_plate = str_replace('-', '', $plate); // Remove hyphens
echo $license_plate. "\n";
}
Output: WORDHERE and 36LXD5.

PHP Parser - Find String in HTML

I want to find a string on another website. I have been looking at parsers and I do not know the best way to do it. I looked at an HTML DOM parser but I need just a simple one line output. I just want to get the link "url: 'http://s2.example.com/streams/i23374.mp4?k=12f34588cf171f3bbf3d35da4db43b06'" to a variable.
<script>
flowplayer("player", "http://www.example.com/flowplayer-3.2.16.swf", {
canvas: {
backgroundGradient: "none",
backgroundColor: "#000000"
},
clip: {
provider: 'lighttpd',
url: 'http://s1.example.com/streams/i23374.mp4?k=12f34588cf171f3bbf3d35da4db43b06',
scaling: 'fit'
},
plugins: {
lighttpd: {
url: 'http://www.example.com/flowplayer.pseudostreaming-3.2.12.swf'
}
}
});
</script>
Here's a handy function for grabbing the text from between two delimiters;
<?php
function extract_unit($string, $start, $end)
{
$pos = stripos($string, $start);
$str = substr($string, $pos);
$str_two = substr($str, strlen($start));
$second_pos = stripos($str_two, $end);
$str_three = substr($str_two, 0, $second_pos);
$unit = trim($str_three); // remove whitespaces
return $unit;
}
echo extract_unit($webpageSource, 'flowplayer("player", "', '", {');
?>
I would use DOMDocument:
For getting a link off of an anchor, it's:
$dd = new DOMDocument;
#$dd->loadHTMLFile('http://s2.example.com/streams/i23374.mp4?k=12f34588cf171f3bbf3d35da4db43b06');
if($a = $dd->getElementsByTagName('a')){
foreach($a as $t){
$links[] = $t->getAttribute('href');
}
}
Now $links is an Array with each href, or if(!isset($links)) there are no results.
To get JSON from a script tag:
$dd = new DOMDocument;
#$dd->loadHTMLFile('http://s2.example.com/streams/i23374.mp4?k=12f34588cf171f3bbf3d35da4db43b06');
if($s = $dd->getElementsByTagName('script')){
$c = $dd->sameHTML($s->item(0)));
}
Change item(0) to the level where the script tag is on their page. Now $c is a String. So:
preg_match_all("/url: '.+'/", $c, $results);
$results is an Array should contain url: 'whatever'.
So:
foreach($results as $v){
$a[] = preg_replace('/url: /', '', $v);
}
$a is Array of results.
Mostly RegExp is the best way to parse string, although it's not recommended to handle JSON.
Here's an example(I encoded the string, it's the same as your raw HTML):
<?php
$data = base64_decode("PHNjcmlwdD4KICAgICAgICAgICAgICAgIGZsb3dwbGF5ZXIoInBsYXllciIsICJodHRwOi8vd3d3LmV4YW1wbGUuY29tL2Zsb3dwbGF5ZXItMy4yLjE2LnN3ZiIsICB7CiAgICAgICAgICAgICAgICAgICAgY2FudmFzOiB7CiAgICAgICAgICAgICAgICAgICAgICAgIGJhY2tncm91bmRHcmFkaWVudDogIm5vbmUiLAogICAgICAgICAgICAgICAgICAgICAgICBiYWNrZ3JvdW5kQ29sb3I6ICIjMDAwMDAwIgogICAgICAgICAgICAgICAgICAgIH0sCiAgICAgICAgICAgICAgICAgICAgY2xpcDogewogICAgICAgICAgICAgICAgICAgICAgICBwcm92aWRlcjogJ2xpZ2h0dHBkJywKICAgICAgICAgICAgICAgICAgICAgICAgdXJsOiAnaHR0cDovL3MxLmV4YW1wbGUuY29tL3N0cmVhbXMvaTIzMzc0Lm1wND9rPTEyZjM0NTg4Y2YxNzFmM2JiZjNkMzVkYTRkYjQzYjA2JywKICAgICAgICAgICAgICAgICAgICAgICAgc2NhbGluZzogJ2ZpdCcKICAgICAgICAgICAgICAgICAgICB9LAogICAgICAgICAgICAgICAgICAgIHBsdWdpbnM6IHsKICAgICAgICAgICAgICAgICAgICAgICAgbGlnaHR0cGQ6IHsKICAgICAgICAgICAgICAgICAgICAgICAgICAgIHVybDogJ2h0dHA6Ly93d3cuZXhhbXBsZS5jb20vZmxvd3BsYXllci5wc2V1ZG9zdHJlYW1pbmctMy4yLjEyLnN3ZicKICAgICAgICAgICAgICAgICAgICAgICAgfQogICAgICAgICAgICAgICAgICAgIH0KICAgICAgICAgICAgICAgIH0pOwogICAgICAgICAgICA8L3NjcmlwdD4=");
if(preg_match('/clip:\s*\{[\s\S]+url:\s*\'(\S+)\',\s*scaling/', $data, $match) === 1)
echo $match[1];
?>
Although it's encoded in JSON, it can't be parsed by PHP's json_decode because PHP's JSON format is too strict (attributes should be wrapped in quotes).

Trying to extract keywords from a website PHP (OOP)

haha, I still have the problem of keywords, but this is a code that I'm creating.
Is a poor code but is my creation:
<?php
$url = 'http://es.wikipedia.org/wiki/Animalia';
Keys($url);
function Keys($url) {
$listanegra = array("a", "ante", "bajo", "con", "contra", "de", "desde", "mediante", "durante", "hasta", "hacia", "para", "por", "que", "qué", "cuán", "cuan", "los", "las", "una", "unos", "unas", "donde", "dónde", "como", "cómo", "cuando", "porque", "por", "para", "según", "sin", "tras", "con", "mas", "más", "pero", "del");
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTMLFile($url);
$webhtml = $doc->getElementsByTagName('p');
$webhtml = $webhtml ->item(0)->nodeValue;
$webhtml = strip_tags($webhtml);
$webhtml = explode(" ", $webhtml);
foreach($listanegra as $key=> $ln) {
$webhtml = str_replace($ln, " ", $webhtml);
}
$palabras = str_word_count ("$webhtml", 1 );
$frq = array_count_values ($palabras);
$frq = asort($frq);
$ffrq = count($frq);
$i=1;
while ($i < $ffrq) {
print $frqq[$i];
print '<br />';
$i++;
}
}
?>
The code trying extract keywords of a website. Extracts the first paragraph of a web, and deletes the words of the variable "$listanegra". Next, counts the repeat words and saves all words in a "array". After i call the array, and this show me the words.
The problem is... the code it's not functional =(.
When i use the code, this shows blank.
Could help me finish my code?. Was recommending me to using "tf-idf", but I will use it later.
I do believe this is what you were trying to do:
$url = 'http://es.wikipedia.org/wiki/Animalia';
$words = Keys($url);
/// do your database stuff with $words
function Keys($url)
{
$listanegra = array('a', 'ante', 'bajo', 'con', 'contra', 'de', 'desde', 'mediante', 'durante', 'hasta', 'hacia', 'para', 'por', 'que', 'qué', 'cuán', 'cuan', 'los', 'las', 'una', 'unos', 'unas', 'donde', 'dónde', 'como', 'cómo', 'cuando', 'porque', 'por', 'para', 'según', 'sin', 'tras', 'con', 'mas', 'más', 'pero', 'del');
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTMLFile($url);
$webhtml = $doc->getElementsByTagName('p');
$webhtml = $webhtml->item(0)->nodeValue;
$webhtml = strip_tags($webhtml);
$webhtml = explode(' ', $webhtml);
$palabras = array();
foreach($webhtml as $word)
{
$word = strtolower(trim($word, ' .,!?()')); // remove trailing special chars and spaces
if (!in_array($word, $listanegra))
{
$palabras[] = $word;
}
}
$frq = array_count_values($palabras);
asort($frq);
return implode(' ', array_keys($frq));
}
Your server should show the errors if you are testing :
add this after
ini_set('display_errors', 1);
ini_set('log_errors', 1);
ini_set('error_log', dirname(__FILE__) . '/error_log.txt');
error_reporting(E_ALL);
that way you will see the error:
Array to string conversion on line 24 (line 19 if you don't put the 5 new lines)
here are some errors i found 4 functions are not used as they should str_replace, str_word_count , asort , array_count_values.
Using str_replace is a little tricky. Trying to find and remove a removes all the "a" in the text even in "animal". (str_replace("a","animal") => nmal)
this link should be usefull : link
asort return true or false so doing just:
asort($frq);
will sort the values in alphabetical order. $frq returns the result of array_count_values --> $frq = array($word1=>word1_count , ...)
the value here is the number of times the word is used so when later you have :
print $**frq**[$i]; // you have print $frqq[$i]; in your code
the result will be empty since the index of this array are the words and the values the number of time the words appear in the text.
Also with str_word_count you must be really careful, since you are reading Hispanic text and text can have numbers you shoudl use this
str_word_count($string,1,'áéíóúüñ1234567890');
The code i would suggest :
<?php
header('Content-Type: text/html; charset=UTF-8');
ini_set('display_errors', 1);
ini_set('log_errors', 1);
ini_set('error_log', dirname(__FILE__) . '/error_log.txt');
error_reporting(E_ALL);
$url = 'http://es.wikipedia.org/wiki/Animalia';
Keys($url);
function Keys($url) {
$listanegra = array("a", "ante", "bajo", "con", "contra", "de", "desde", "mediante", "durante", "hasta", "hacia", "para", "por", "que", "qué", "cuán", "cuan", "los", "las", "una", "unos", "unas", "donde", "dónde", "como", "cómo", "cuando", "porque", "por", "para", "según", "sin", "tras", "con", "mas", "más", "pero", "del");
$html=file_get_contents($url);
$doc = new DOMDocument('1.0', 'UTF-8');
$html = mb_convert_encoding($html, "HTML-ENTITIES", "UTF-8");
libxml_use_internal_errors(true);
$doc->loadHTML($html);
$webhtml = $doc->getElementsByTagName('p');
$webhtml = $webhtml ->item(0)->nodeValue;
$webhtml = strip_tags($webhtml);
print_r ($webhtml);
$webhtml = explode(" ", $webhtml);
// $webhtml = str_replace($listanegra, " ", $webhtml); str_replace() accepts array
foreach($listanegra as $key=> $ln) {
$webhtml = preg_replace('/\b'.$ln.'\b/u', ' ', $webhtml);
}
$palabras = str_word_count(implode(" ",$webhtml), 1, 'áéíóúüñ1234567890');
sort($palabras);
$frq = array_count_values ($palabras);
foreach($frq as $index=>$value) {
print "the word <strong>$index</strong> was used <strong>$value</strong> times";
print '<br />';
}
}
?>
Was really painfull trying to figure out the special chars issues

extracting anchor values hidden in div tags

From a html page I need to extract the values of v from all anchor links…each anchor link is hidden in some 5 div tags
<a href="/watch?v=value to be retrived&list=blabla&feature=plpp_play_all">
Each v value has 11 characters, for this as of now am trying to read it by character by character like
<?php
$file=fopen("xx.html","r") or exit("Unable to open file!");
$d='v';
$dd='=';
$vd=array();
while (!feof($file))
{
$f=fgetc($file);
if($f==$d)
{
$ff=fgetc($file);
if ($ff==$dd)
{
$idea='';
for($i=0;$i<=10;$i++)
{
$sData = fgetc($file);
$id=$id.$sData;
}
array_push($vd, $id);
That is am getting each character of v and storing it in sData variable and pushing it into id so as to get those 11 characters as a string(id)…
the problem is…searching for the ‘v=’ through the entire html file and if found reading the 11characters and pushing it into a sData array is sucking, it is taking considerable amount of time…so pls help me to sophisticate the things
<?php
function substring(&$string,$start,$end)
{
$pos = strpos(">".$string,$start);
if(! $pos) return "";
$pos--;
$string = substr($string,$pos+strlen($start));
$posend = strpos($string,$end);
$toret = substr($string,0,$posend);
$string = substr($string,$posend);
return $toret;
}
$contents = #file_get_contents("xx.html");
$old="";
$videosArray=array();
while ($old <> $contents)
{
$old = $contents;
$v = substring($contents,"?v=","&");
if($v) $videosArray[] = $v;
}
//$videosArray is array of v's
?>
I would better parse HTML with SimpleXML and XPath:
// Get your page HTML string
$html = file_get_contents('xx.html');
// As per comment by Gordon to suppress invalid markup warnings
libxml_use_internal_errors(true);
// Create SimpleXML object
$doc = new DOMDocument();
$doc->strictErrorChecking = false;
$doc->loadHTML($html);
$xml = simplexml_import_dom($doc);
// Find a nodes
$anchors = $xml->xpath('//a[contains(#href, "v=")]');
foreach ($anchors as $a)
{
$href = (string)$a['href'];
$url = parse_url($href);
parse_str($url['query'], $params);
// $params['v'] contains what we need
$vd[] = $params['v']; // push into array
}
// Clear invalid markup error buffer
libxml_clear_errors();

Categories