Extract specification list from text (unknown format)

Extract specification list from text (unknown format) - php

How can I extract a specification from product description which is of unknown format (sometimes it is unordered list, sometimes it's br-element etc.) but it ALWAYS looks same way on front.
The visual is like:
Some description text, somethimes it is one sentence, sometimes more..
== sometimes here's an empty line, sometimes is not ==
spec item1
spec item2
Is there a way to extract that "by it's visual' in PHP?
Example:
<h2> desc <br>
<br>
> <strong> T Shirt</strong><br>
> Breathable mesh fabric<br>
> Reflective detail<br>
> Flat lock seams <br>

You could try and filter your entries. I've managed to get your example into an array. It would then be a case of a little wrangling with the result:
<?php
$html =<<<HTML
<h2> desc </h2>
<br>
> <strong> T Shirt</strong><br>
> Breathable mesh fabric<br>
> Reflective detail<br>
> Flat lock seams <br>
HTML;
$no_html = strip_tags($html);
$no_entities = preg_replace('/&#?[a-z0-9]+;/i', '', $no_html);
$parts = preg_split('/\R/', $no_entities);
$trimmed_parts = array_map('trim', $parts);
var_export($trimmed_parts);
Output:
array (
0 => 'desc',
1 => '',
2 => 'T Shirt',
3 => 'Breathable mesh fabric',
4 => 'Reflective detail',
5 => 'Flat lock seams',
)

This can be done with file_get_contents() and some regex processing. Please ensure you have proper settings enabled in PHP.ini (fopen URL wrappers)
Refer:
http://php.net/manual/en/filesystem.configuration.php
Sample code:
<?php
$page = file_get_contents('Provide your url here');
preg_match("/regex pattern here/", $page, $agent_name);
// display agent name matches
print_r($agent_name)
Personal suggestion - Using python will simpler the process. Lot of packages are available already for this purpose. Eg- bs4

Related

How to parse URL when using the pager plugin with AJAX

I am trying to use the tablesorter pager plugin with AJAX but run into som problemes (or limitations) when trying to handle the AJAX request in my php backend.
If eg. the table is set up with a default sorting of
sortList: [ [0,1], [1,0] ]
I will get a URL like this on my AJAX request:
page=0&size=50&filter=fcol[6]=batteri&sort=col[0]=1&col[1]=0
In my php back end I do a
$cur_sort = $_GET['sort']
and get
col[0]=1
So the last part is missing - I guess since it contains a & char.
How do I get the entire sort string?
That said how is the string col[0]=1&col[1]=0 best parsed? I need to extract the info that col 0 is to be sorter DESC and col 1 ASC.

You can try this;
parse_str($_SERVER['QUERY_STRING'],$data);
It will parse the url to an array;
Also; you should use empty [] instead of [1] and [0]
See more here: parse_str()
Example:
$str = "page=0&size=50&filter=fcol[6]=batteri&sort=col[0]=1&col[1]=0";
parse_str($str, $output);
echo $output['page']; // echo 0
And to answer your question; it is correct; is echoing col[0]=1 because you are dividing with & see here:
&sort=col[0]=1 & col[1]=0;
An advice; use more names, instead.
You could use
&sort[]=1&sort[]=0;
UPDATE:
To access the last one; you should do, simply;
$_GET['col'][1];
If you want to access, the last number in
$_GET['sort'];
You can do this;
$explode = explode('=',$_GET['sort']);
$end = end($explode);
echo $end; //it will outout 1
If you print your entire query_String, it will print this;
Array
(
[page] => 0
[size] => 50
[filter] => fcol[6]=batteri
[sort] => col[0]=1
[col] => Array
(
[1] => 0
)
)

I'm not sure how the ajaxUrl option is being used, but the output shared in the question doesn't look right.
I really have no idea how the string in the question is showing this format:
&sort=col[0]=1&col[1]=0 (where did sort= come from?)
&filter=fcol[6]=batteri (where did filter= come from?)
If you look at how you can manipulate the ajaxUrl option, you will see this example:
ajaxUrl: "http://mydatabase.com?page={page}&size={size}&{sortList:col}&{filterList:fcol}"
So say you have the following settings:
page = 2
size = 10
sortList is set to [[0,1],[3,0]] (1st column descending sort, 4th column ascending sort)
filters is set as ['','','fred']
The resulting url passed to the server will look like this:
http://mydatabase.com?page=2&size=10&col[0]=1&col[3]=0&fcol[2]=fred
The col part of the {sortList:col} placeholder sets the sorted column name passed to the URL & the fcol part of {filterList:fcol} placeholder sets the filter for the set column. So those are not fixed names.
If the above method for using the ajaxUrl string doesn't suit your needs, you can leave those settings out of the ajaxUrl and instead use the customAjaxUrl option to modify the URL as desired. Here is a simple example (I know this is not a conventional method):
ajaxUrl: "http://mydatabase.com?page={page}&size={size}",
// modify the url after all processing has been applied
customAjaxUrl: function(table, url) {
var config = table.config,
// convert [[0,1],[3,0]] into "0-1-3-0"
sort = [].concat.apply( [], config.sortList ).join('-'),
// convert [ '', '', 'fred' ] into "--fred"
filter = config.lastSearch.join('-');
// send the server the current page
return url += '&sort=' + sort + '&filter=' + filter
}
With the same settings, the resulting URL will now look like this:
http://mydatabase.com?page=2&size=10&sort=0-1-3-0&filter=--fred

This is my own best solution so far, but it's not really elegant:
if (preg_match_all("/[^f]col\[\d+]=\d+/", $_SERVER['QUERY_STRING'], $matches)) {
foreach($matches[0] AS $sortinfo) {
if (preg_match_all("/\d+/", $sortinfo, $matches)) {
if(count($matches[0]) == 2) {
echo "Col: ".$matches[0][0]."<br/>";
echo "Order: ".$matches[0][1]."<br/>";
}
}
}
}
It gives me the info I need
Col: 0
Order: 1
Col: 1
Order: 0
But is clumbsy. Is there a better way?

xpath is failing me

I have an xml structure :
<Articles>
<Article ID="111">
<author>Peter Paul</author>
<pubDate>01/01/2015</pubDate>
<Translations>
<lang1>English</lang1>
<lang2>French</lang2>
<lang3>Arab</lang3>
<lang3>Chinese</lang3>
</Translations>
</Article>
<Article ID="222">
<author>Monkey Rice</author>
<pubDate>01/01/2016</pubDate>
<Translations>
<lang1>English</lang1>
</Translations>
</Article>
<Article ID="333">
<author>John Silas</author>
<pubDate>01/01/2017</pubDate>
<Translations>
<lang1>English</lang1>
<lang2>French</lang2>
<lang3>Arab</lang3>
<lang3>Chinese</lang3>
</Translations>
</Article>
</Articles>
i created a method AddRecordByInfoMatch() that attempts to
add new node to any of a given ID anywhere as long
as a match exists:
function AddRecordByInfoMatch($ParentID, $Info_1, $Info_2, $Info_3, array $Record){
$xml = new SimpleXMLElement(blabla.xml);
$result = $xml->xpath("//*[#ID='$ParentID']"); //get the article ID
if(!empty($result)){
foreach($result[0] as $key => $value){
$noofChild = count($value);
//three info match may lakely be within 3 sub-nodes
if($noofChild >= 3){
$query = "//*[node()[contains(text(), \"$Info_1\")] and node()[contains(text(), \"$Info_2\")] and node()[contains(text(), \"$Info_3\")]]";
$data = $xml->xpath($query);
if(!empty($data[0]){
foreach ($Record as $nodname => $val){
$data[0]->addChild($nodname, $val);
}
}
}
}
}
}
with ID = 333 in mind, i test-ran it like this :
XMLAddRecordByInfoMatch(333, "English", "French", "Chinese", array(
"syntax" => irrelevant,
"adjectives" => None,
"verbs" => 2,
"prepositions" => 5
));
unfortunately, the output; upon display, added the new record to Article
with ID 111 to give :
...
<Article ID="111">
<author>Peter Paul</author>
<pubDate>01/01/2015</pubDate>
<Translations>
<lang1>English</lang1>
<lang2>French</lang2>
<lang3>Arab</lang3>
<lang3>Chinese</lang3>
<syntax>irrelevant</syntax>
<adjectives>None</adjectives>
<verbs>2</verbs>
<prepositions>5</prepositions>
</Translations>
</Article>
...
and i expected this to be within the Article node of ID 333, which
i specified in the function call. what am i doing wrong in my xpath xpression ?? or how
can i achieve this ? any help will be highly regarded. Happy new year all.

what am I doing wrong in my xpath xpression ?
One mistake I could spot is that (common when users ask here on Stackoverflow under the PHP tag for xpath) you're not aware of the xpath injection that is possible in your script.
Therefore for the PHP example I'll be given I will make it safe for that, the function used is taken from Mitigating XPath Injection Attacks in PHP which also has more information on the topic.
Next to that (common) mistake, what springs into view directly is that you do many things here while you could just express it with a single XPath expression.
You want the first element which has the ID attribute of a certain value that then contains a child element which contains at least three children elements of which three of them must contain one out of three texts.
For an exmplary ID of 333 and three exemplary texts "English", "French" and "Chinese" the XPath query looks like:
(
//*[#ID=333]
/*[ count(*) > 2
and (
*[contains(., 'English')]
and *[contains(., 'French')]
and *[contains(., 'Chinese')]
)
]
/..
)[1]
As you can see, there is not much meaning to wrap more PHP code around it.
Next to these most visible points, it should be noted that the infos are better as an array with three values than three numbered variables ($infos = ["English", "French", "Chinese"];).
Example:
$expr = sprintf("
(
//*[#ID=%d]
/*[ count(*) > 2
and (
*[contains(., %s)]
and *[contains(., %s)]
and *[contains(., %s)]
)
]
/..
)[1]",
$parentId, xpath_string($infos[0]), xpath_string($infos[1]), xpath_string($infos[2])
);
list($element) = $xml->xpath($expr) + [NULL];
if (empty($element)) {
// element not found
return;
}
// extend element
foreach ($record as $nodname => $val) {
$element->addChild($nodname, $val);
}
This gives the expected result:
<Article ID="333">
<author>John Silas</author>
<pubDate>01/01/2017</pubDate>
<Translations>
<lang1>English</lang1>
<lang2>French</lang2>
<lang3>Arab</lang3>
<lang3>Chinese</lang3>
</Translations>
<syntax>irrelevant</syntax><adjectives>None</adjectives><verbs>2</verbs><prepositions>5</prepositions></Article>

XML parsing with XPath and PHP handle empty values

while parsing through a XML tree like this:
<vco:ItemDetail>
<cac:Item>
<cbc:Description>ROLLENKERNSATZ 20X12 MM 10,5 GR.</cbc:Description>
<cac:SellersItemIdentification>
<cac:ID>78392636</cac:ID>
</cac:SellersItemIdentification>
<cac:ManufacturersItemIdentification>
<cac:ID>RMS100400370</cac:ID>
<cac:IssuerParty>
<cac:PartyName>
<cbc:Name></cbc:Name>
</cac:PartyName>
</cac:IssuerParty>
</cac:ManufacturersItemIdentification>
</cac:Item>
<vco:Availability>
<vco:Code>available</vco:Code>
</vco:Availability>
</vco:ItemDetail>
I always get a blank space which breaks my CSV structure if cbc:Name is empty, which looks like this:
"ROLLENKERNSATZ 20X12 MM 10,5 GR.";78392636;;RMS100400370;"";available
The "available" string is in a new line so my CSV is not structered any more.
My XPath array looks like this:
$columns = array('Description' => 'string(cac:Item/cbc:Description)',
'SellersItemIdentification' => 'string(cac:Item/cac:SellersItemIdentification/cac:ID)',
'StandardItemIdentification' => 'string(cac:Item/cac:StandardItemIdentification/cac:ID)',
'Availability' => 'string(vco:Availability/vco:Code)',
'Producer' => 'string(cac:Item/cac:ManufacturersItemIdentification/cac:IssuerParty/cac:PartyName/cbc:Name');
Is there any expception or replacement I can handle with like replacing the empty node value with "no producer" or something like this?
Thank you

If the value to use as the 'default' can somehow be made to exist somewhere in your input document, the problem can be solved with a quasi-coalesce like this:
e.g.
<vco:ItemDetail xmlns:vco="x1" xmlns:cac="x2" xmlns:cbc="x3">
<ValueIfNoProducer>No Producer</ValueIfNoProducer>
<cac:Item>
...
Then this Xpath 1.0 will apply a default if the Name element is empty or whitespace:
(cac:Item/cac:ManufacturersItemIdentification/cac:IssuerParty
/cac:PartyName/cbc:Name[normalize-space(.)] | ValueIfNoProducer)[1]
I think the following is possible directly in XPath 2.0, but I stand to be corrected:
(cac:Item/cac:ManufacturersItemIdentification/cac:IssuerParty
/cac:PartyName/cbc:Name[normalize-space(.)] | 'No Producer')[1]

PHP Web scraping of Javascript generated contents [duplicate]

This question already has answers here:
Scrape web page data generated by javascript
(2 answers)
Closed 8 years ago.
I am stuck with a scraping task in my project.
i want to grab the data from the link in $html , all table content of tr and td , here i am trying to grab the link but it only shows javascript: self.close()
<?php
include("simple_html_dom.php");
$html = file_get_html('http://www.areacodelocations.info/allcities.php?ac=201');
foreach($html->find('a') as $element)
echo $element->href . '<br>';
?>

Usually, this kind of pages load a bunch of Javascript (jQuery, etc.), which then builds the interface and retrieves the data to be displayed from a data source.
So what you need to do is open that page in Firefox or similar, with a tool such as Firebug in order to see what requests are actually being done. If you're lucky, you will find it directly in the list of XHR requests. As in this case:
http://www.govliquidation.com/json/buyer_ux/salescalendar.js
Notice that this course of action may infringe on some license or terms of use. Clear this with the webmaster/data source/copyright owner before proceeding: detecting and forbidding this kind of scraping is very easy, and identifying you is probably only slightly less so.
Anyway, if you issue the same call in PHP, you can directly scrape the data (provided there is no session/authentication issue, as seems the case here) with very simple code:
<?php
$url = "http://www.govliquidation.com/json/buyer_ux/salescalendar.js";
$json = file_get_contents($url);
$data = json_decode($json);
?>
This yields a data object that you can inspect and convert in CSV by simple looping.
stdClass Object
(
[result] => stdClass Object
(
[events] => Array
(
[0] => stdClass Object
(
[yahoo_dur] => 11300
[closing_today] => 0
[language_code] => en
[mixed_id] => 9297
[event_id] => 9297
[close_meridian] => PM
[commercial_sale_flag] => 0
[close_time] => 01/06/2014
[award_time_unixtime] => 1389070800
[category] => Tires, Parts & Components
[open_time_unixtime] => 1388638800
[yahoo_date] => 20140102T000000Z
[open_time] => 01/02/2014
[event_close_time] => 2014-01-06 17:00:00
[display_event_id] => 9297
[type_code] => X3
[title] => Truck Drive Axles # Killeen, TX
[special_flag] => 1
[demil_flag] => 0
[google_close] => 20140106
[event_open_time] => 2014-01-02 00:00:00
[google_open] => 20140102
[third_party_url] =>
[bid_package_flag] => 0
[is_open] => 1
[fda_count] => 0
[close_time_unixtime] => 1389045600
You retrieve $data->result->events, use fputcsv() on its items converted to array form, and Bob's your uncle.

In the case of the second site, you have a table with several TR elements, and you want to catch the first two TD children of each TR.
By inspecting the source code you see something like this:
<tr>
<td> Allendale</td>
<td> Eastern Time
</td>
</tr>
<tr>
<td> Alpine</td>
<td> Eastern Time
</td>
So you just grab all the TR's
<?php
include("simple_html_dom.php");
$html = file_get_html('http://www.areacodelocations.info/allcities.php?ac=201');
$fp = fopen('output.csv', 'w');
if (!$fp) die("Cannot open output CSV - permission problems maybe?");
foreach($html->find('tr') as $tr) {
$csv = array(); // Start empty. A new CSV row for each TR.
// Now find the TD children of $tr. They will make up a row.
foreach($tr->find('td') as $td) {
// Get TD's innertext, but
$csv[] = $td->innertext;
}
fputcsv($fp, $csv);
}
fclose($fp);
?>
You will notice that the CSV text is "dirty". That is because the actual text is:
<td> Alpine</td>
<td> Eastern Time[CARRIAGE RETURN HERE]
</td>
So to have "Alpine" and "Eastern Time", you have to replace
$csv[] = $td->innertext;
with something like
$csv[] = strip(
html_entity_decode (
$td->innertext,
ENT_COMPAT | ENT_HTML401,
'UTF-8'
)
);
Check out the PHP man page for html_entity_decode() about character set encoding and entity handling. The above ought to work -- and an ought and fifty cents will get you a cup of coffee :-)

Synonym finder algorithm

I think example will be much better than loooong description :)
Let's assume we have an array of arrays:
("Server1", "Server_1", "Main Server", "192.168.0.3")
("Server_1", "VIP Server", "Main Server")
("Server_2", "192.168.0.4")
("192.168.0.3", "192.168.0.5")
("Server_2", "Backup")
Each line contains strings which are synonyms. And as a result of processing of this array I want to get this:
("Server1", "Server_1", "Main Server", "192.168.0.3", "VIP Server", "192.168.0.5")
("Server_2", "192.168.0.4", "Backup")
So I think I need a kind of recursive algorithm. Programming language actually doesn't matter — I need only a little help with idea in general. I'm going to use php or python.
Thank you!

This problem can be reduced to a problem in graph theory where you find all groups of connected nodes in a graph.
An efficient way to solve this problem is doing a "flood fill" algorithm, which is essentially a recursive breath first search. This wikipedia entry describes the flood fill algorithm and how it applies to solving the problem of finding connected regions of a graph.
To see how the original question can be made into a question on graphs: make each entry (e.g. "Server1", "Server_1", etc.) a node on a graph. Connect nodes with edges if and only if they are synonyms. A matrix data structure is particularly appropriate for keeping track of the edges, provided you have enough memory. Otherwise a sparse data structure like a map will work, especially since the number of synonyms will likely be limited.
Server1 is Node #0
Server_1 is Node #1
Server_2 is Node #2
Then edge[0][1] = edge[1][0] = 1, indicated that there is an edge between nodes #0 and #1 ( which means that they are synonyms ). While edge[0][2] = edge[2][0] = 0, indicating that Server1 and Server_2 are not synonyms.
Complexity Analysis
Creating this data structure is pretty efficient because a single linear pass with a lookup of the mapping of strings to node numbers is enough to crate it. If you store the mapping of strings to node numbers in a dictionary then this would be a O(n log n) step.
Doing the flood fill is O(n), you only visit each node in the graph once. So, the algorithm in all is O(n log n).

Introduce integer marking, which indicates synonym groups. On start one marks all words with different marks from 1 to N.
Then search trough your collection and if you find two words with indexes i and j are synonym, then remark all of words with marking i and j with lesser number of both. After N iteration you get all groups of synonyms.
It is some dirty and not throughly efficient solution, I believe one can get more performance with union-find structures.

Edit: This probably is NOT the most efficient way of solving your problem. If you are interested in max performance (e.g., if you have millions of values), you might be interested in writing more complex algorithm.
PHP, seems to be working (at least with data from given example):
$data = array(
array("Server1", "Server_1", "Main Server", "192.168.0.3"),
array("Server_1", "VIP Server", "Main Server"),
array("Server_2", "192.168.0.4"),
array("192.168.0.3", "192.168.0.5"),
array("Server_2", "Backup"),
);
do {
$foundSynonyms = false;
foreach ( $data as $firstKey => $firstValue ) {
foreach ( $data as $secondKey => $secondValue ) {
if ( $firstKey === $secondKey ) {
continue;
}
if ( array_intersect($firstValue, $secondValue) ) {
$data[$firstKey] = array_unique(array_merge($firstValue, $secondValue));
unset($data[$secondKey]);
$foundSynonyms = true;
break 2; // outer foreach
}
}
}
} while ( $foundSynonyms );
print_r($data);
Output:
Array
(
[0] => Array
(
[0] => Server1
[1] => Server_1
[2] => Main Server
[3] => 192.168.0.3
[4] => VIP Server
[6] => 192.168.0.5
)
[2] => Array
(
[0] => Server_2
[1] => 192.168.0.4
[3] => Backup
)
)

This would yield lower complexity then the PHP example (Python 3):
a = [set(("Server1", "Server_1", "Main Server", "192.168.0.3")),
set(("Server_1", "VIP Server", "Main Server")),
set(("Server_2", "192.168.0.4")),
set(("192.168.0.3", "192.168.0.5")),
set(("Server_2", "Backup"))]
b = {}
c = set()
for s in a:
full_s = s.copy()
for d in s:
if b.get(d):
full_s.update(b[d])
for d in full_s:
b[d] = full_s
c.add(frozenset(full_s))
for k,v in b.items():
fsv = frozenset(v)
if fsv in c:
print(list(fsv))
c.remove(fsv)

I was looking for a solution in python, so I came up with this solution. If you are willing to use python data structures like sets
you can use this solution too. "It's so simple a cave man can use it."
Simply this is the logic behind it.
foreach set_of_values in value_collection:
alreadyInSynonymSet = false
foreach synonym_set in synonym_collection:
if set_of_values in synonym_set:
alreadyInSynonymSet = true
synonym_set = synonym_set.union(set_of_values)
if not alreadyInSynonymSet:
synonym_collection.append(set(set_of_values))
vals = (
("Server1", "Server_1", "Main Server", "192.168.0.3"),
("Server_1", "VIP Server", "Main Server"),
("Server_2", "192.168.0.4"),
("192.168.0.3", "192.168.0.5"),
("Server_2", "Backup"),
)
value_sets = (set(value_tup) for value_tup in vals)
synonym_collection = []
for value_set in value_sets:
isConnected = False # If connected to a term in the graph
print(f'\nCurrent Value Set: {value_set}')
for synonyms in synonym_collection:
# IF two sets are disjoint, they don't have common elements
if not set(synonyms).isdisjoint(value_set):
isConnected = True
synonyms |= value_set # Appending elements of new value_set to synonymous set
break
# If it's not related to any other term, create a new set
if not isConnected:
print ('Value set not in graph, adding to graph...')
synonym_collection.append(value_set)
print('\nDone, Completed Graphing Synonyms')
print(synonym_collection)
This will have a result of
Current Value Set: {'Server1', 'Main Server', '192.168.0.3', 'Server_1'}
Value set not in graph, adding to graph...
Current Value Set: {'VIP Server', 'Main Server', 'Server_1'}
Current Value Set: {'192.168.0.4', 'Server_2'}
Value set not in graph, adding to graph...
Current Value Set: {'192.168.0.3', '192.168.0.5'}
Current Value Set: {'Server_2', 'Backup'}
Done, Completed Graphing Synonyms
[{'VIP Server', 'Main Server', '192.168.0.3', '192.168.0.5', 'Server1', 'Server_1'}, {'192.168.0.4', 'Server_2', 'Backup'}]

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Extract specification list from text (unknown format) - php

Related

How to parse URL when using the pager plugin with AJAX

xpath is failing me

XML parsing with XPath and PHP handle empty values

PHP Web scraping of Javascript generated contents [duplicate]

Synonym finder algorithm

Categories

Resources