Search PDF for strings and get their position on the page

Search PDF for strings and get their position on the page - php

I would like to add nameddests to locations of an existing PDF which are specified by some string (say: put a nameddest on the first occurence of the string "Chapter 1"). Then I would like to be able to jump to those nameddests using JS events.
What I have achieved so far using PHP and FPDF/FPDI: I can load an existing PDF using FPDI and add nameddests to arbitrary positions using a slightly modified version of [1]. I can then embed the PDF in an iframe and navigate to the nameddests using, for example, JS buttons.
However, so far I need to find out the positions of the nameddests by hand. How can I search the PDF for strings and get the page numbers and positions of the search results such that I can add nameddests there?
[1] http://www.fpdf.org/en/script/script99.php

It is impossible to analyse the content of a PDF document with FPDI.
We (Setasign - author of FPDI and PDF_NamedDestinations) have a product (not free) available which allows you to do handle this task: The SetaPDF-Extractor component.
A simple POC of your project could look like:
<?php
// load and register the autoload function
require_once('library/SetaPDF/Autoload.php');
$writer = new SetaPDF_Core_Writer_Http('result.pdf', true);
$document = SetaPDF_Core_Document::loadByFilename('file/with/chapters.pdf', $writer);
$extractor = new SetaPDF_Extractor($document);
// define the word strategy
$strategy = new SetaPDF_Extractor_Strategy_Word();
$extractor->setStrategy($strategy);
// get the pages helper
$pages = $document->getCatalog()->getPages();
// get access to the named destination tree
$names = $document
->getCatalog()
->getNames()
->getTree(SetaPDF_Core_Document_Catalog_Names::DESTS, true);
for ($pageNo = 1; $pageNo <= $pages->count(); $pageNo++) {
/**
* #var SetaPDF_Extractor_Result_Word[] $words
*/
$words = $extractor->getResultByPageNumber($pageNo);
// iterate over all found words and search for "Chapter" followed by a numeric string...
foreach ($words AS $word) {
$string = $word->getString();
if ($string === 'Chapter') {
$chapter = $word;
continue;
}
if (null === $chapter) {
continue;
}
// is the next word a numeric string
if (is_numeric($word->getString())) {
// get the coordinates of the word
$bounds = $word->getBounds()[0];
// create a destination
$destination = SetaPDF_Core_Document_Destination::createByPageNo(
$document,
$pageNo,
SetaPDF_Core_Document_Destination::FIT_MODE_FIT_BH,
$bounds->getUl()->getY()
);
// create a name (shall be unique)
$name = strtolower($chapter . $word->getString());
try {
// add the named destination to the name tree
$names->add($name, $destination->getPdfValue());
} catch (SetaPDF_Core_DataStructure_Tree_KeyAlreadyExistsException $e) {
// handle this exception
}
}
$chapter = null;
}
}
// save and finish the resulting document
$document->save()->finish();
You can then access the named destinations through the URL this way (the viewer application and browser plugin need to support this):
http://www.example.com/script.php#chapter1
http://www.example.com/script.php#chapter2
http://www.example.com/script.php#chapter10
...

Related

Can't get destination URL of Ad by Adwords API

I want to get the destination URL by using Google Adwords API(v201509).
Cording with PHP.
In the following code, I'm trying to get the url by using 'get' method of AdGroupAdService.
As a result, I could get ad->displayUrl properly but couldn't get ad->url and ad->finalUrls (null given).
What am I doing wrong?
adwords.php with the following code -
$adGroupAdService = $user->GetService('AdGroupAdService', ADWORDS_VERSION);
// Create selector.
$selector = new Selector();
$selector->fields = array('Headline', 'Id');
$selector->ordering[] = new OrderBy('Headline', 'ASCENDING');
// Create paging controls.
$selector->paging = new Paging(0, AdWordsConstants::RECOMMENDED_PAGE_SIZE);
do {
// Make the get request.
$page = $adGroupAdService->get($selector);
// Display results.
if (isset($page->entries)) {
foreach ($page->entries as $adGroupAd) {
array_push($googleAccountStructure, $adGroupAd);
//var_dump($adGroupAd);
}
}
// Advance the paging index.
$selector->paging->startIndex += AdWordsConstants::RECOMMENDED_PAGE_SIZE;
} while ($page->totalNumEntries > $selector->paging->startIndex);

Please update your selector fields with this one
$selector->fields = array('Headline', 'Id', 'CreativeFinalUrls', 'Url');
As per adwords api doc if you use Upgraded URLs you need to pass Final URLs in selector fields
https://developers.google.com/adwords/api/docs/reference/v201509/AdGroupAdService.Ad#finalUrls

Laravel 5 Include blade templates dynamically (replace div)

I have a CMS where users can create and edit their own content in their websites. I also provide the possibility to include forms and galleries by simply replacing specific Div's in their content.
In the past I simply exploded the content on these Div's to an array, replaced the whole Div's with the needed html code (by using PHP's include) to show the form or gallery at that exact position, imploded the whole array to a string again (html) and used in the website.
Now I am trying to achieve the same in Laravel 5:
// example plugins div in HTML
// ******************************
<div class="plugin form"></div>
// PageController.php
// ******************************
$page = Page::where(('url', '=', "home")->first();
$page->text = Helpers::getPlugins($page->text);
// Helpers.php (a non default custom class with functions)
// ******************************
public static function getPlugins($page)
{
$dom = new DOMDocument();
$dom->loadHTML($page, LIBXML_HTML_NOIMPLIED);
$x = $dom->getElementsByTagName("div");
foreach ($x as $node)
{
if (strstr($node->getAttribute('class'), "plugin"))
{
$plugin = explode(" ",$node->getAttribute('class'));
$filename = base_path() . "/resources/views/plugins/" . trim($plugin[1]) . ".blade.php";
if (is_file($filename))
{
ob_start();
include($filename);
ob_get_contents();
$node->nodeValue = ob_get_clean();
}
else
{
$node->nodeValue = "Plugin <strong>".$node->getAttribute('class')."</strong> Not found</div>";
}
}
}
return $dom->saveHTML();
}
Sofar so good, the content is returned but what I get is all the pure text blade markup instead of the Laravel generated html which I want to use.
I think there is a way this could work but I cannot come to think of it.

Try manually building the template by using the method BladeCompiler->compile(), read more here
Edit: I think the facade Blade::compile() will give you access to this function too, just add use Blade at the top of the file.

Paste a PDF-file into another PDF with ZendPDF

Is it possible to "merge" or "paste" a PDF-file into antother PDF? Or must it be a image instead?
The PDF i want to to paste or merge, is a simple picture that is going to appear at the bottom of the finished PDF:
//Generate the "Original" PDF here..
function addReklam($reklamblad) //The PDF that should be merged into the PDF that is created above
{
//Count how many pages that has been created, and add it at the bottom of the PDF:
if($this->drawed_lines<52)
{
$this->active_page = $this->pdf->pages[2];
}
elseif($this->drawed_lines<92)
{
$this->active_page = $this->pdf->pages[3];
}
elseif($this->drawed_lines<132)
{
$this->active_page = $this->pdf->pages[4];
}
else
{
$this->active_page = $this->pdf->pages[5];
}
//$this->active_page = $this->pdf->pages[5]; // page 5 is the last
//Add it here???
}

My recommendation would be to use the Zend_Pdf::load() method to load the "Original" PDF file into a local instance of Zend_Pdf and then you can access the pages using the pages[] array as in your sample code and use the all the standard functions like drawImage() etc to make the needed modifications prior to saving the updated version.

PHP DOMDocument - trouble accessing list index

I am writing some code for an IRC bot written in php and running on the linux cli. I'm having a little trouble with my code to retrieve a websites title tag and display it using DOMDocument NodeList. Basically, on websites with two or more tags (and you would be surprised how many there actually are...) I want to process for only the first title tag. As you can see from the code below (which is working fine for processing one, or more tags) there is a foreach block where it iterates through each title tag.
public function onReceivedData($data) {
// loop through each message token
foreach ($data["message"] as $token) {
// if the token starts with www, add http file handle
if (strcmp(substr($token, 0, 4), "www.") == 0) {
$token = "http://" . $token;
}
// validate token as a URL
if (filter_var($token, FILTER_VALIDATE_URL)) {
// create timeout stream context
$theContext['http']['timeout'] = 3;
$context = stream_context_create($theContext);
// get contents of url
if ($file = file_get_contents($token, false, $context)) {
// instantiate a new DOMDocument object
$dom = new DOMDocument;
// load the html into the DOMDocument obj
#$dom->loadHTML($file);
// retrieve the title from the DOM node
// if assignment is valid then...
if ($title = $dom->getElementsByTagName("title")) {
// send a message to the channel
foreach ($title as $theTitle) {
$this->privmsg($data["target"], $theTitle->nodeValue);
}
}
} else {
// notify of failure
$this->privmsg($data["target"], "Site could not be reached");
}
}
}
}
What I'd prefer, is to somehow limit it to only processing the first title tag. I'm aware that I can just wrap an if statement around it with a variable so it only echos one time, but I'm more looking at using a "for" statement to process a single iteration. However, when I do this, I can't access the title attribute with $title->nodeValue; it says it's undefined, and only when i use the foreach $title as $theTitle can I access the values. I've tried $title[0]->nodeValue and $title->nodeValue(0) to retrieve the first title from the list, but unfortunately to no avail. A bit stumped and a quick google didn't turn up a lot.
Any help would be greatly appreciated! Cheers, and I'll keep looking too.

You can solve this with XPath:
$dom = new DOMDocument();
#$dom->loadHTML($file);
$xpath = new DOMXPath($dom);
$title = $xpath->query('//title')->item(0)->nodeValue;

Try something like this:
$title->item(0)->nodeValue;
http://www.php.net/manual/en/class.domnodelist.php

I'm trying to scrape a specific div with an id on a page

I want to scrape the contents of a page, well really just a single div from that page, and display it to the user inside of a small div on a webpage. I just need a piece of info from a carfax page that needs user credentials so I can't post the exact code but I tried using google.com and have the same problem so the solution should cross over.
Right now I've tried this:
$webPage = file_get_contents('http://www.google.com');
$doc = new DOMDocument();
$doc->loadHTML($webPage);
$div = $doc->getElementById('lga');//this is the id to the div holding the image above the textbox
//echo $webPage;//this displays www.google.com minus the image. I imagine because of the file path
//var_dump($div);//this display "object(DOMElement)#2 (0) { }" and I'm not sure what that means
//echo $div;//this has a server error
I'm also looking at simple_html_dom.php trying to figure that out.

You can use this:
/**
* Downloads a web page from $url, selects the the element by $id
* and returns it's xml string representation.
*/
function getElementByIdAsString($url, $id, $pretty = true) {
$doc = new DOMDocument();
#$doc->loadHTMLFile($url);
if(!$doc) {
throw new Exception("Failed to load $url");
}
// Obtain the element
$element = $doc->getElementById($id);
if(!$element) {
throw new Exception("An element with id $id was not found");
}
if($pretty) {
$doc->formatOutput = true;
}
// Return the string representation of the element
return $doc->saveXML($element);
}
// call it:
echo getElementByIdAsString('http://www.google.com', 'lga');

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Search PDF for strings and get their position on the page - php

Related

Can't get destination URL of Ad by Adwords API

Laravel 5 Include blade templates dynamically (replace div)

Paste a PDF-file into another PDF with ZendPDF

PHP DOMDocument - trouble accessing list index

I'm trying to scrape a specific div with an id on a page

Categories

Resources