What makes this jsessionid shows up on this PHP-code result? - php

I want to do parsing on this site: CiteSeerx Result.
I tried this:
<?php
include('simple_html_dom.php');
$url = 'http://citeseerx.ist.psu.edu/search?q=mean&t=doc&sort=rlv&start=0';
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$curl_scraped_page = curl_exec($ch);
$html = new simple_html_dom();
$html->load($curl_scraped_page);
foreach ($html->find('div.result h3') as $title) {
echo $title->plaintext . '<br/>';
}
echo '---<br>';
foreach ($html->find('div.result h3 a') as $link) {
echo '\'http://citeseeerx.ist.psu.edu' . $link->href . '<br>';
}
echo '---<br>';
foreach ($html->find('div.pubinfo') as $info){
echo $info->innertext. '<br>';
}
echo '---<br>';
foreach ($html->find('div.snippet') as $snippet){
echo $snippet->innertext. '<br>';
}
?>
It works and gives me what I want, it's just that, this jsessionid=... shows up on every single line of the $link results.
What do I do to make it disappear? I googled for addressing this problem, but all I find is the way to solve it with Java, not PHP.
Thanks.

<a class="remove doc_details" href="/viewdoc/summary;jsessionid=103B4C6E9ADA3C8B17DD64BD57238F9D?doi=10.1.1.160.3832">
because the href in the tag includes the jsession id part :)

Related

php multidimensional array to string or table

how can I output this array into html table?
for each row I would like to output it like this, within the foreach;
echo "<td>".$lat."</td><td>".$long."</td>";
as per example on https://developer.here.com/documentation/routing-waypoints/topics/quick-start-simple-car.html
I have tried the code
$api_url = "https://wse.api.here.com/2/findsequence.json?start=Berlin-Main-Station;52.52282,13.37011&destination1=East-Side-Gallery;52.50341,13.44429&destination2=Olympiastadion;52.51293,13.24021&end=HERE-Berlin-Campus;52.53066,13.38511&mode=fastest;car&app_id=ID&app_code=CODE";
$api_response = file_get_contents($api_url);
$api_response_decoded = json_decode($api_response, true);
foreach($api_response_decoded as $api_response_decoded_row){
print_r($api_response_decoded_row[0][waypoints]);
}
and also tried
print_r($api_response_decoded_row[0][waypoints][id]);
and also tried
echo($api_response_decoded_row[0][waypoints][id]);
and also tried
implode($api_response_decoded_row[0][waypoints][id]);
Here's one way you could do it if the comments didn't already help you enough.
foreach($api_response_decoded as $api_response_decoded_rows){
foreach ($api_response_decoded_rows[0]['waypoints'] as $waypoint) {
$html = '
<td>'.$waypoint['lat'].'</td>
<td>'.$waypoint['lng'].'</td>
';
echo $html;
}
}
Thanks to commenters and answerers. In case it helps someone else, full working code is therefore;
$api_url = "https://wse.api.here.com/2/findsequence.json?start=Berlin-Main-Station;52.52282,13.37011&destination1=East-Side-Gallery;52.50341,13.44429&destination2=Olympiastadion;52.51293,13.24021&end=HERE-Berlin-Campus;52.53066,13.38511&mode=fastest;car&app_id=ID&app_code=CODE";
$api_response = file_get_contents($api_url);
$api_response_decoded = json_decode($api_response, true);
echo "<table>";
foreach($api_response_decoded as $api_response_decoded_rows){
foreach ($api_response_decoded_rows[0]['waypoints'] as $waypoint) {
$html = '<tr><td>'.$waypoint['sequence'].'</td><td>'.$waypoint['id'].'</td><td>'.$waypoint['lat'].'</td><td>'.$waypoint['lng'].'</td></tr>';
echo $html;
}
}
echo "</table>";

Unable to parse links (href) from a page via PHP

Please see my script below :
<?php
function getContent ()
{
$ch = curl_init();
curl_setopt($ch,CURLOPT_URL, 'http://localhost/test.php/test2.php');
curl_setopt($ch,CURLOPT_RETURNTRANSFER,true);
$output=curl_exec($ch);
curl_close($ch);
return $output;
}
function getHrefFromLinks ($cString){
libxml_use_internal_errors(true);
$dom = new DomDocument();
$dom->loadHTML($cString);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//a/#href');
foreach($nodes as $href) {
echo $href->nodeValue; echo "<br />"; // echo current attribute value
$href->nodeValue = 'new value'; // set new attribute value
$href->parentNode->removeAttribute('href'); // remove attribute
}
foreach (libxml_get_errors() as $error) {
}
libxml_clear_errors();
}
echo getHrefFromLinks (getContent());
?>
The output of http://localhost/test.php/test2.php is :
Luck</span> LuckyLuck</span>'s Locki
When echo getHrefFromLinks (getContent()); runs, the output is :
/oncelink/index.html<br />/oncelink-2/lucky<br />
This is wrong, as the output should be :
/oncelink/index.html<br />/oncelink-2/lucky'locki<br />
I understand that the href value generated from the link is somehow incorrect as it includes an additional apostrophe but I won't be able to change that as it is pre-generated.
The other question is, how can I get the value of the span tag :
<span class="lsbold">
Thanks in advance!
SOLVED :)
Well. If it's stupid but it works, then it aint stupid :D
Just added the following code in the end :
$fix = str_replace("href='", 'href="', getContent());
$fix = str_replace("'>", '">', $fix);
echo getHrefFromLinks ($fix);

cURL: variable in foreach loop

Good day, can anyone help me to figure out what is wrong in my code or if I coded it the wrong way.
The curl part is ok my problem is when I started to get the file using foreach loop the result is broken image.
I've try it in array but nothings happen. I'm new with this, maybe I'm missing something here
Here is my code:
<?php
$url = "http://XXXXXXXXXXXXXX"; //Base Url
$parameters = ['mode' => 'contributors']; // riders, current_rounds, contributors, season_entries
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $parameters);
curl_setopt($ch,CURLOPT_HTTPHEADER, ['x-weplaymedia-authorisation:XXXXXXXXXXXXX']);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$result = curl_exec($ch); // Execute
$arr = json_decode($result,true); // Dump result here.
//print_r($arr);
If you run print_r($arr); it will display array of fields.
But when I try to point certain fields ([fwcContributors]) in my foreach loop code im getting a broken images.
Here is the image of array:
Here is the result
What I want is to display their profile picture from [profilePicture] and username from [userName].
$i=0;
foreach ($arr['fwcContributors'] as $val)
{
if($i++ == 5);
echo '<tbody >';
echo '<tr style="transform: skewX(-20deg);">';
echo '<td>';
echo '<img src='.($val['profilePicture']) .' style="transform: skewX(20deg);">' . htmlspecialchars($val['userName']);
echo '</td>';
echo '</tr>';
}
?>
Thank you in advance.
There are nested arrays in fwcContributors, of which you probably want ContributorList to iterate over:
foreach ($arr['fwcContributors']['ContributorList'] as $val)
{
echo '<tbody >';
echo '<tr style="transform: skewX(-20deg);">';
echo '<td>';
echo '<img src='.($val['profilePicture']) .' style="transform: skewX(20deg);">' . htmlspecialchars($val['userName']);
echo '</td>';
echo '</tr>';
}
(Took the $i statements out, as they don't seem to do anything.)

PHPQuery - get all links of contains specific url page

I am trying to get all links of contains specific url page on a given page using PHPQuery. I am using the PHP support syntax of PHPQuery.
include_once 'phpQuery.php';
$url = 'http://www.phonearena.com/phones/manufacturer/';
$doc = phpQuery::newDocumentFile($url);
$urls = $doc['a'];
foreach ($urls as $url) {
echo pq($url)->attr('href') . '<br>';
}
The code above works . But it shows all the links
I want to show only those containing "/phones/manufacturer/".
I tried this but it shows nothing:
include_once 'phpQuery.php';
$url = 'http://www.phonearena.com/phones/manufacturer/';
$doc = phpQuery::newDocumentFile($url);
$urls = $doc['a'];
foreach ($urls as $url) {
echo pq($url)->attr('href:contains("/phones/manufacturer/")') . '<br>';
}
Use below coding get all urls from that site,
$doc = new DOMDocument();
#$doc->loadHTML(file_get_contents('http://www.phonearena.com/phones/manufacturer/'));
$ahreftags = $doc->getElementsByTagName('a');
foreach ($ahreftags as $tag) {
echo "<br/>";
echo $tag->getAttribute('href');
echo "<br/>";
}
exit;
Try this, a little italian guide, jquery documentation
include_once 'phpQuery.php';
$url = 'http://www.phonearena.com/phones/manufacturer/';
$doc = phpQuery::newDocumentFile($url);
$urls = $doc['a[href*="/phones/manufacturer/"]'];
foreach ($urls as $url) {
echo pq($url)->attr('href') . '<br>';
}

Scraping with Simple HTML DOM Parser but it stops suddenly

I'm trying to scrape the following page: http://mangafox.me/manga/
I wanted the script to click on each of those links and scrape the details of each manga and for the most part my code does exactly that. It works, but for some reason the page just stops loading midway (it doesn't even go through the # list).
There is no error message so I don't know what I'm looking for. I would appreciate some advice on what I'm doing wrong.
Code:
<?php
include('simple_html_dom.php');
set_time_limit(0);
//ini_set('max_execution_time', 300);
//Creates an instance of the simple_html_dom class
$html = new simple_html_dom();
//Loads the page from the URL entered
$html->load_file('http://mangafox.me/manga');
//Finds an element and if there is more than 1 instance the variable becomes an array
$manga_urls = $html->find('.manga_list a');
//Function which retrieves information needed to populate the DB from indiviual manga pages.
function getmanga($value, $url){
$pagehtml = new simple_html_dom();
$pagehtml->load_file($url);
if ($value == 'desc') {
$description = $pagehtml->find('p.summary');
foreach($description as $d){
//return $d->plaintext;
return $desc = $d->plaintext;
}
unset($description);
} else if ($value == 'status') {
$status = $pagehtml->find('div[class=data] span');
foreach ($status as $s) {
$status = explode(",", $s->plaintext);
return $status[0];
}
unset($status);
} else if ($value == 'genre') {
$genre = $pagehtml->find('//*[#id="title"]/table/tbody/tr[2]/td[4]');
foreach ($genre as $g) {
return $g->plaintext;
}
unset($genre);
} else if ($value == 'author') {
$author = $pagehtml->find('//*[#id="title"]/table/tbody/tr[2]/td[2]');
foreach ($author as $a) {
return $a->plaintext;
}
unset($author);
} else if ($value == 'release') {
$release = $pagehtml->find('//*[#id="title"]/table/tbody/tr[2]/td[1]');
foreach ($release as $r) {
return $r->plaintext;
}
unset($release);
} else if ($value == 'image') {
$image = $pagehtml->find('.cover img');
foreach ($image as $i) {
return $i->src;
}
unset($image);
}
$pagehtml->clear();
unset($pagehtml);
}
foreach($manga_urls as $url) {
$href = $url->href;
if (strpos($href, 'http') !== false){
echo 'Title: ' . $url->plaintext . '<br />';
echo 'Link: ' . $href . '<br />';
echo 'Description: ' . getmanga('desc', $href) . '<br />';
echo 'Status: ' . getmanga('status',$href) . '<br />';
echo 'Genre: ' . getmanga('genre', $href) . '<br />';
echo 'Author: ' . getmanga('author', $href) . '<br />';
echo 'Release: ' . getmanga('release', $href) . '<br />';
echo 'Image Link: ' . getmanga('image', $href) . '<br />';
echo '<br /><br />';
}
}
$html->clear();
unset($html);
?>
So, it was not a 'just do this' fix, but I did it ;)
Beside the fact is was importing the sub pages way too much, it also had a huge simple_html_dom to iterate through. It has like 13307 items, and simple_html_dom is not made for speed or efficiency. It allocated much space for things you didn't need in this case. That is why I replaced the main simple_html_dom with a regular expression.
I think it still takes ages to load fully, and you are better of using a other language, but this is a working result :-)
https://gist.github.com/dralletje/ee996ffe4c957cdccd01
I have faced the same issue, when the loop with 20k iterations stopped without any error message. So posting the solution so it might help someone.
The issue seems to be of performance as stated before. So I decided to use curl instead of simple html dom. The function bellow returns content of website:
function getContent($url){
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$result = curl_exec($ch);
curl_close($ch);
if($result){
return $result;
}else{
return "";
}
}
Now to traverse the DOM, I am still using simple html dom, but the code is changed as:
$content = getContent($url);
if($content){
// Create a DOM object
$doc = new simple_html_dom();
// Load HTML from a string
$doc->load($content);
}else{
continue;
}
And at the end of each loop close and unset variable as:
$doc->clear();
unset($doc);

Categories