How to index document pages that can be retrieved later into elasticsearch - php

I am indexing PDF documents in elasticsearch without using the official plugin, I am using a PHP library to parse PDF content into plain text. This PHP library allows me to get the document content by pages so I would like my search page would retrieve some highlight similar to:
[Page 1] ... Highlighted text from the search ... [Page 4] ... Highlighted text from page 4 that match with the search ...
The mapping they gave to me is like this, I just converted text from string to array:
properties: {
highlight:{
text: [ "Page1Content...", "Page2Content...", "Page3Content...", ...],
other_fields: {}
},
other_fields: {}
}
But I cannot find a way to get the array indexes when getting the highlighted content, it get lost in the way.
Are nested / objects the only way to know the page number when I search? I don't know if array keys are lost when highlighting too. I thought in something like or similar:
highlight : {
text: {
"Page1" : "Page1Content",
"Page2": "Page2Content",
....
},
other_fields: {}
}
Thanks in advance.

Related

How to store efficiently all text edits?

I'm building PHP Symfony 4 + Vue.js app. And one of it's part — text editor, which remembers all edits in text. I saw how the same problem solved in ACE editor — they save every letter, but I don't need such precision. It will be enough to store substrings with type of action. Something like:
{ action: add, from: 0, to: null, text: "New string" }
And result text will be: "New string". Then I changed something:
{ action: add, from: 3, to: 4, text: 'Delicious S' }
And result became: "New Delicious String". Storing such "commits" with text, I can reconstruct state of text in a some moment, beeing applied published commits until required moment.
Do I think correct or are there some better way?
Maybe you know some efficient algorithm or data structure to store such task?
What edit coordinates in text should I use, to work with them more efficiently?

Algolia highlighting in Laravel 5.3

I am using Laravel 5.3 and Algolia,
I want to highlight the search result,I read the documentation but I still don't know how to do it.
https://www.algolia.com/doc/api-client/php/parameters#attributestohighlight
Any ideas?
When search results are returned from Algolia, they will wrap the "highlighted part" with <em> </em> tags by default. This happens right out of the box, so all you really need to do is use CSS to customize the look of <em>s within your search results div to get the effect you want.
Of course if you prefer that they wrap highlighted text in something other than <em> then you can customize it with anything you wish (such as maybe a span tag with a "highlighted-search" class or something). You customize this when initializing the search in your PHP.
$index = $client->initIndex('contacts');
$result = $index->search('search query', ['attributesToRetrieve' => 'firstname,lastname', 'hitsPerPage' => 50, 'highlightPreTag' => '<span class="highlighted-search">', 'highlightPostTag' => '</span>']);
Now let's say you search 'John D' and submit that search query. Algolia will return a string to the effect of:
<span class="highlighted-search">John D</span>oe
Now with your CSS you customize it like
span.highlighted-search {
background-color:yellow;
}
and now it will highlight the search query with yellow.
Of course this is all assuming you want the static (PHP Library) server side search results. I highly recommend that you use the autocomplete.js library so you can get live search results as you type. This requires using the js libraries to return results client-side while the user types. It is a much better experience.
In each item returned by the search engine, there is an extra "_highlightResult" attribute that contains some metadata and the value of the searchable attributes modified with search terms surrounded with <em>.
For instance, for the search term "toux", the returned JSON will looks like:
{
"medicament" : "VICKS TOUX SECHE 7,33 mg ADULTES MIEL, pastille",
"_highlightResult" : {
"medicament" : {
"value" : "VICKS <em>TOUX</em> SECHE 7,33 mg ADULTES MIEL, pastille",
"matchedWords" : ["toux"]
....
...
}
To highlight the search results, with the search terms, you simply have to display the attribute value under "_highlightResult" instead of the raw one.
If you are not using it already, I would recommend you to use Instantsearch.js.
Highlighting the typed keywords of the search results is dealt whitin the hits widget the same way, in its templates.item parameter.
You may find live code example of this feature here https://community.algolia.com/instantsearch.js/examples/

How to combine the text node of 2 pieces of extracted data using Goutte/Domcrawler

I've been trying to figure out how to combine two pieces of extracted text into a single result (array). In this case, the title and subtitle of a variety of books.
<td class="item_info">
<span class="item_title">Carrots Like Peas</span>
<em class="item_subtitle">- And Other Fun Facts</em>
</td>
The closest I've been able to get is:
$holds = $crawler->filter('span.item_title,em.item_subtitle');
Which I've managed to output with the following:
$holds->each(function ($node) {
echo '<pre>';
print $node->text();
echo '</pre>';
});
And results in
<pre>Carrots Like Peas</pre>
<pre>- And Other Fun Facts</pre>
Another problem is that not all the books have subtitles, so I need to avoid combining two titles together.
How would I go about combining those two into a single result (or array)?
In my case, I took a roundabout way to get where I wanted to be. I stepped back one level in the DOM to the td tag and grabbed everything and dumped it into the array.
I realized that DomCrawler's documentation had the example code to place the text nodes into an array.
$items_out = $crawler->filter('td.item_info')->each(function (Crawler $node, $i) {
return $node->text();
});
I'd tried to avoid capturing the td because author's were also included in those cells. After even more digging, I was able to strip the authors from the array with the following:
foreach ($items_out as &$items) {
$items = substr($items,0, strpos($items,' - by'));
}
Just took me five days to get it all sorted out. Now onto the next problem!
As per Goutte Documentation, Goutte utilizes the Symfony DomCrawler component. Information on adding content to a DomCrawler object can be found atSymfony DomCrawler - Adding Content

mySql retrieving data between square brackets

I have strings of data in a field named content, one record may look something like:
loads of text ... [attr1] some text [attr2] more text [attr3] more text etc...
What I'm looking to do is get all the text within the square brackets; so that I can put it into a PHP array. Is this even possible with mySql?
I've seen the following post: Looking to extract data between parentheses in a string via MYSQL, but they are looking to only extract one value from between their parentheses, I have an unknown number of them. After reading that post I've though of doing something like the following;
SELECT substr(content,instr(content,"["), instr(content,"]")) as attrList from myTable
Which would grab me the following:
[attr1] some text [attr2] some more text [attr3]
and I can use PHP to strip the rest of the text out and then explode the string into an array, but is there a better way to do this just using mySql where I can retrieve something like:
[attr1][attr2][attr3]
I was thinking perhaps regex, but I see that just returns a true of false which doesn't help me a lot.
After even more research, I'm not sure it's possible in mySql, and I might need the results in string or array form depending on where I'm using them in my app.
So I've created a new method to return the list after I've got the data from the database (with a little help from this post: PHP: Capturing text between square brackets):
public function attrList($array=false)
{
preg_match_all("/\[.*?\]/",$this->content,$matches);
$params = str_replace(array('[',']'),'',$matches[0]);
return ($array===false) ? implode(', ',$params) : $params;
}

jQuery Autocomplete Remote

I'm using this http://jqueryui.com/demos/autocomplete/#multiple-remote on my website, I can see this code 'searches' search.php does anyone know what format search.php should be in and what it should look like?
Thanks,
From the page you linked, it shows you the expected data format. (pasted below) At it's simplest, you can have a print statement in the search.php file that simply echoes back some hard coded contents. A more elaborate solution is to have your search.php pull real time from a database and then format the data as expected.
Expected data format
The data from local data, a url or a callback can come in two variants:
An Array of Strings:
[ "Choice1", "Choice2" ]
An Array of Objects with label and value properties:
[ { label: "Choice1", value: "value1" }, ... ]
So, just to get off the ground and see it working, use this line for search.php and build from there by customizing your choices or connecting to a db etc.
print '[ "Choice1", "Choice2" ]';

Categories