hidden space in the middle of scraped text - php

im doing some data scraping ... basically i'm getting some webpage using curl , extract the data and check my database to see if they exist in my db .
so i was been looking for Beijing Guoan (Chn) in a webpage source code and i couldn't find it , but it was there and i could see it in the browser .
$result = phpQuery::newDocument( file_get_contents('www.site.com/page'), 'text/html');
foreach($result->find('td.table-participant-teams') as $t )
{
list( $host , $guest ) = explode( ' - ' , pq($t)->text());
echo $host.' == Beijing Guoan (Chn) ==> ';
echo $host == 'Beijing Guoan (Chn)' ? ' found it ' : ' false ';
}
result :
Beijing Guoan (Chn) == Beijing Guoan (Chn) ==> false
i did a strlen($host) and i found $host was 20 charchter while Beijing Guoan (Chn) has 19 .... basically there is hidden charachter in $host
so i've added
for($i = 0 ; $i < strlen($host) ; $i++)
{
echo $i.' - '.$host[$i];
echo '<br />';
}
and i got
0 - B
1 - e
2 - i
3 - j
4 - i
5 - n
6 - g
7 -
8 - G
9 - u
10 - o
11 - a
12 - n
13 -
14 -
15 - (
16 - C
17 - h
18 - n
19 - )
as you can see in 13,14 i got 2 spaces , but when i print out $host i only have 1 ! and that's what cuzing all the trouble
so whay there is a extra space in my $host but it wont show when i print it out on the screen and how can i get rid of it ?
please note that i don't want to just remove that extra space from this specific string , there might be other cases with different char-length , iwant a solution that works on all of them

HTML renders multiple consecutive space as one. If you view the source you will see the actual data.
To replace multiple consecutive white space you can use the following
echo preg_replace('/ +/', ' ', 'he llo test');

Related

extract the last 8 letters of string with substr

I'm trying to get the last 9 characters of $span.
$span = "";
foreach($html->find('span') as $element1){
if (strpos($element1->outertext, 'kcal') !== false){
$span .= $element1->outertext.'<br>';
}
}
echo substr($span,-9);
It just show me white page, any suggestions?
Edit:
When i debug with var_dump($span) it shows exactly the following:
string(761) " 1 Porsiyon (Orta) AnçuezSardalya Salatası 319 kcal 1 Su Bardağı Ayran (Yağsız) 41 kcal 1 Su Bardağı Anne Sütü 138 kcal 1 Porsiyon (Orta) Amasya Yöresine Özgü Keşkek 728 kcal 1 Porsiyon (Orta) Anne Kurabiyesi 504 kcal "
use trim() to remove white spaces
so you can write
echo substr(trim($span),-9);

Embed URL to coordinates, kml or geojson

I've got a complex directions URL and embed URL that I would like to get polylines for. Once I can get them into polylines or something similar I can convert to final format: GeoJSON.
Direction Link
-or-
Embed Link
I have looked at the API's and I can't find anything that accepts or would decode the PB (what is this? it's not a protocol buffer). So far this is as far as I've got:
//php
$pb_array = explode('!', $pb);
foreach($pb_array as $key => $value){
echo "$key - $value<br/>";
}
===
1 - 1m73
2 - 1m12
3 - 1m3
4 - 1d1472548.9575794793
5 - 2d-72.8191002664707
6 - 3d43.87505426780168
7 - 2m3
8 - 1f0
9 - 2f0
10 - 3f0
11 - 3m2
12 - 1i1024
13 - 2i768
14 - 4f13.1
15 - 4m58
16 - 3e0
17 - 4m5
18 - 1s0x0%3A0xa58b3d6041ba69f8
19 - 2sGuilford+Welcome+Center
20 - 3m2
21 - 1d42.8120069
22 - 2d-72.56614689999999
23 - 4m3
24 - 3m2
25 - 1d43.3893165
26 - 2d-72.40772249999999
27 - 4m5
28 - 1s0x4cb52e78df455c83%3A0xb6946ec850907db8
29 - 2s130+Lower+Michigan+Road%2C+Pittsfield%2C+VT+05762
30 - 3m2
31 - 1d43.76898
32 - 2d-72.815214
33 - 4m4
34 - 1s0x0%3A0xea2de48bba82cc86
35 - 3m2
36 - 1d44.042544
37 - 2d-72.6046997
38 - 4m5
39 - 1s0x0%3A0x6bb602ed58bf4413
40 - 2sJay+Peak+Resort
41 - 3m2
42 - 1d44.9379515
43 - 2d-72.5045433
44 - 4m5
45 - 1s0x4cb392aaa4333a07%3A0x160aef1559868340
46 - 2sDolly+Copp+Campground+Rd%2C+Gorham%2C+NH+03581
47 - 3m2
48 - 1d44.335842199999995
49 - 2d-71.21837339999999
50 - 4m5
51 - 1s0x4cb392684201a94d%3A0xfa4a6f490a05429d
52 - 2sMt+Washington+Auto+Road%2C+1+Mount+Washington+Auto+Road%2C+Gorham%2C+NH+03581
53 - 3m2
54 - 1d44.288384099999995
55 - 2d-71.22459599999999
56 - 4m5
57 - 1s0x4cb38e798f42c3d9%3A0xc3b88e4dac01db12
58 - 2sMt+Washington
59 - 3m2
60 - 1d44.270585399999995
61 - 2d-71.3032723
62 - 4m5
63 - 1s0x89e2a7fa444124d5%3A0xe3ed24b6f864eba0
64 - 2sWells%2C+ME
65 - 3m2
66 - 1d43.322232899999996
67 - 2d-70.5805209
68 - 4m5
69 - 1s0x89e2ba813e828c71%3A0x8cdf74380f6a933d
70 - 2sLibby's+Oceanside+Camp%2C+York+Street%2C+York%2C+ME
71 - 3m2
72 - 1d43.147162
73 - 2d-70.626173
74 - 5e1
75 - 3m2
76 - 1sen
77 - 2sus
78 - 4v1472497940601
The closest hints I could find are from this thread. I will keep looking but I'm stuck.
I'm trying to create an API based solution that has an input of one of these URL's and returns a GeoJSON.
would decode the PB (what is this? it's not a protocol buffer)
For the record, because this overflow question keeps popping up on google results: it is a protocol buffer. PB litteraly stands for protocol buffer.
It's just a different ASCII encoding (a compact URL encoding reminiscent of the binary encoding, not the usual JSON-like text encoding. When you squint at it it's not that much different than torrent's structure encoding), and Google doesn't provide us the .proto file.
For each field:
first character is the id (identifies the field according to the corresponding .proto file)
second character is the type of the field
m is for message
s is for string
i, j, u, v are for various type of ints
f, d are for floating points
e is for enum
the rest is the payload
So to unpack the fields you're seeing (even if we don't have the .proto file):
1m73 message of type 1, containing 73 elements (the whole message set)
1m12 submessage of type 1, contains 12 elements (probably information about the view box in the map box)
1m3 sub-sub-message type 1, contains 3 elements (probably map coordinates)
1d1472548.9575794793 first double field (probably zoom level)
2d-72.8191002664707 second double field (probably longitude)
3d43.87505426780168 third double field (probably latitude)
2m3 second sub-sub-sub message (no idea given that it's not filled. Maybe a starting point if you code a route instead of a single point ?)
1f0, 2f0, 3f0 the three members, currently just zero
3m2 third block (looks like a screen resolution)
1i1024, 2i768 : 1024x768 ? (and probably the omited field 3 would have been the color depth if present ??)
4f13.1 no idea, but it's a float
4m58 next message with 58 elements (to me it looks like a bunch of POI that we need to display in the box)
3e0 an enum, set to zero (this one would be completely impossible to interpret without a proto or without experimenting, as you need the list of enums)
4m5 five more elements probably a map poi
- 1s0x0%3A0xa58b3d6041ba69f8 string '0x0:0xa58b3d6041ba69f8', note the use of Url_encoded character. In turn it looks like a pair of hex numbers, maybe a GUID ?
- 2sGuilford+Welcome+Center string, the name with plus instead of blank (like most URLs)
- 3m2 two elements to come
- 1d42.8120069 and 2d-72.56614689999999 doubles probably map coordinates
4m3 again a message of type 4 in this level, so probably another poi
- 3m2, 1d43.3893165, 2d-72.40772249999999 but this one only specifies coordinates, and nothing else
4m5 another poi
- 1s0x4cb52e78df455c83%3A0xb6946ec850907db8 different pair of hex, GUID
...you got the idea...
5e1 another bunch of information probably general settings
3m2 this setting is a message (and looks like a locale)
1sen, 2sus locale is en_US
4v1472497940601 some other large number...
Note: the original proto that Google doesn't show us, is probably a single multi level structure. Thus, the sub-sub-message ID don't have always the same meaning: they aren't global ID, but ID within the parent message.
inside a sub message ID 1 (view box ?), the sub-sub message 3 seems to be resolution.
inside a sub message ID 4 (POIs ?), the sub-ID 3 isn't even a message but some enum
inside a sub message ID 5 (parameters), the sub-sub message 3 is a locale
Well, here's my sloppy but working solution. My needs required GeoJSON, but others can use the google maps service to get the desired output once you have the lat/lng array.
$embed = '<iframe src="https://www.google.com/maps/embed?pb=!1m73!1m12!1m3!1d1472548.9575794793!2d-72.8191002664707!3d43.87505426780168!2m3!1f0!2f0!3f0!3m2!1i1024!2i768!4f13.1!4m58!3e0!4m5!1s0x0%3A0xa58b3d6041ba69f8!2sGuilford+Welcome+Center!3m2!1d42.8120069!2d-72.56614689999999!4m3!3m2!1d43.3893165!2d-72.40772249999999!4m5!1s0x4cb52e78df455c83%3A0xb6946ec850907db8!2s130+Lower+Michigan+Road%2C+Pittsfield%2C+VT+05762!3m2!1d43.76898!2d-72.815214!4m4!1s0x0%3A0xea2de48bba82cc86!3m2!1d44.042544!2d-72.6046997!4m5!1s0x0%3A0x6bb602ed58bf4413!2sJay+Peak+Resort!3m2!1d44.9379515!2d-72.5045433!4m5!1s0x4cb392aaa4333a07%3A0x160aef1559868340!2sDolly+Copp+Campground+Rd%2C+Gorham%2C+NH+03581!3m2!1d44.335842199999995!2d-71.21837339999999!4m5!1s0x4cb392684201a94d%3A0xfa4a6f490a05429d!2sMt+Washington+Auto+Road%2C+1+Mount+Washington+Auto+Road%2C+Gorham%2C+NH+03581!3m2!1d44.288384099999995!2d-71.22459599999999!4m5!1s0x4cb38e798f42c3d9%3A0xc3b88e4dac01db12!2sMt+Washington!3m2!1d44.270585399999995!2d-71.3032723!4m5!1s0x89e2a7fa444124d5%3A0xe3ed24b6f864eba0!2sWells%2C+ME!3m2!1d43.322232899999996!2d-70.5805209!4m5!1s0x89e2ba813e828c71%3A0x8cdf74380f6a933d!2sLibby's+Oceanside+Camp%2C+York+Street%2C+York%2C+ME!3m2!1d43.147162!2d-70.626173!5e1!3m2!1sen!2sus!4v1472497940601" width="600" height="450" frameborder="0" style="border:0" allowfullscreen></iframe>';
$array = array();
preg_match( '/src="([^"]*)"/i', $embed, $array ) ;
list($pre, $pb) = split("pb=", $array[1]);
if($pb == "" || strpos($pb, "!") === false)
die(json_encode(array("success"=>false)));
//echo "PB Extracted:<br>";
//echo $pb;
///echo "<br><br>Decode:<br/>";
$pb_array = explode('!', $pb);
$coords = array();
$address;$addressHex;
$results = array();
foreach($pb_array as $key => $value){
//uncomment to debug output
//echo "$key - $value<br/>";
if($value == "3m2" || $value == "2m2"){
//3m2 seems to be the divider of these 'places'
if(count($coords) != 3) //don't add the center of map data (3 coordinates [height, lng, lat])
array_push($results, array("coords"=>$coords,"address"=>$address,"addressHex"=>$addressHex));
$coords = array(); //reset array
}else{
$type = substr($value, 1, 1);
$stype = substr($value, 0, 2);
$value = substr($value, 2);
//echo "$type - $value<br/>";
if($type == "d"){
//Found Lat,Lng
array_push($coords, $value);
}else if($stype == "2s"){
//Address
$address = $value;
}else if($stype == "1s"){
//Address Encoded in some way
$addressHex = $value;
}
}
}
//echo "<br><br>Google Result<br/>";
//echo json_encode($results);
//echo "<br><br>Mapbox API:<br/>";
$waypoints = array();
for($i=0;$i<count($results);$i++){
if(count($results[$i]["coords"])){
$lat = $results[$i]["coords"][0];
$lng = $results[$i]["coords"][1];
array_push($waypoints, "$lng%2C$lat");
}
}
$waypoints = implode("%3B", $waypoints); //convert to string
$mapbox_api_key = "pk.eyJ1I.....";
$url = "https://api.mapbox.com/directions/v5/mapbox/driving/$waypoints.json?steps=false&alternatives=false&overview=full&geometries=geojson&access_token=$mapbox_api_key";
//echo "<br><br>Mapbox Response:<br/>";
$response = file_get_contents($url);
$json = json_decode($response,true);
//echo "<br><br>Mapbox Geometry:<br/>";
$coordinates = $json["routes"][0]["geometry"]["coordinates"];
$geojson = (array("type"=>"FeatureCollection","features"=>array(array("type"=>"Feature","geometry"=>array("type"=>"LineString","coordinates"=>$coordinates),"properties"=>array()))));
echo json_encode(array("success"=>true, "geojson"=>$geojson));

An odd assignment about adding dashes to strings. PHP

I have a need for a function that will do the following thing:
If I have a string like this "2 1 3 6 5 4 8 7" I have to insert dashes between pairs of numbers following some rules.
The rules are simple.
Put a dash between two numbers if the first one of the pair is smaller then the one that follows it. Do all possible combinations of this and if a pair already has a dash then the space next to it can't have a dash.
Basically my results for above string would be
2 1-3 6 5 4 8 7
2 1-3 6 5 4-8 7
2 1 3-6 5 4 8 7
2 1 3-6 5 4-8 7
2 1 3 6 5 4-8 7
I did create a function that does this but I am thinking it is pretty sluggish and I don't want to taint your ideas with it. If possible I would like to know how you guys are thinking about this and even some pseudo code or code would be great.
EDIT 1:
here is the code I have so far
$string = "2 1 3 6 5 4 8 7";
function dasher($string){
global $dasherarray;
$lockcodes = explode(' ', $string);
for($i = 0; $i < count($lockcodes) - 1; $i++){
if(strlen($string) > 2){
$left = $lockcodes[$i];
$right = $lockcodes[$i+1];
$x = $left . ' ' . $right;
$y = $left . '-' . $right;
if (strlen($left) == 1 && strlen($right) == 1 && (int)$left < (int)$right) {
$dashercombination = str_replace($x, $y, $string);
$dasherarray[] = $dashercombination;
dasher($dashercombination);
}
}
}
return array_unique($dasherarray);
}
foreach(dasher($string) as $combination) {
echo $combination. '<br>';
}
Perhaps this will be helpful in terms of offering different methods to parse the string.
$str="2 1 3 6 5 4 8 7";
$sar=explode(' ',$str);
for($i=1;$i<count($sar);$i++)
if($sar[$i-1]<$sar[$i])
print substr_replace($str,'-',2*($i-1)+1,1) . "\n";
Note that the code expects only single digits numbers in the string.
Note that the code expects that the string is formatted as per your example. It would be good to add some sanity checks (collapse multiple spaces, strip/trim blanks at the beginning/end).
We can improve upon this by finding all the spaces in the string and using them to index substrings for comparison, still assuming that only a single spaces separates adjacent numbers.
<?php
$str="21 11 31 61 51 41 81 71";
$letter=' ';
#This finds the locations of all the spaces in the strings
$spaces = array_keys(array_intersect(str_split($str),array($letter)));
#This function takes a start-space and an end-space and finds the number between them.
#It also takes into account the special cases that we are considering the first or
#last space in the string
function ssubstr($str,$spaces,$start,$end){
if($start<0)
return substr($str,0,$spaces[$end]);
if($end==count($spaces))
return substr($str,$spaces[$start],strlen($str)-$spaces[$start]);
return substr($str,$spaces[$start],$spaces[$end]-$spaces[$start]);
}
#This loops through all the spaces in the string, extracting the numbers on either side for comparison
for($i=0;$i<count($spaces);$i++){
$firstnum=ssubstr($str,$spaces,$i-1,$i);
$secondnum=ssubstr($str,$spaces,$i,$i+1) . "\n";
if(intval($firstnum)<intval($secondnum))
print substr_replace($str,'-',$spaces[$i],1) . "\n";
}
?>
Note the explicit conversion to integers in order to avoid lexicographic comparison.

Str replace dont seem to match certain strings

I have a script which takes in some user input, cleans it and tries to replace the value in a string. I found that the str replace that I use cant seem to match e.g. 11 +tum. Why is that? Can I fix it some way? Does preg replace manage it, and if so how does that look in preg replace?
Function
The script prepares the user input string for a full text query, all words are mandatory so each space is replaced with space+. But some phrases like 11 tumneed to be searchable and thus put in double quotes. The failing part is that the scirpt cant seem to match some phrases even though echoing the valus before comparison shows they are the same, e.g. 11 tum
Code:
//processedQuery e.g. 'laptop 11 tum'
$processedQuery = str_replace(" "," +",$processedQuery);
echo processedQuery; //parses laptop +11 +tum
foreach($commonQuery as $value){ //$commonQuery = array("11 tum", "13 tum", "15 tum", "17 tum", "asus eee", "asus 1005","asus 1010")
//compile : simulated query format error
$simulatedErrorValue = str_replace(" "," +",$value);
echo simulatedErrorValue; //parses 11 +tum
$processedQuery = str_replace($simulatedErrorValue,'"'.$value.'"',$processedQuery);
}
echo $processedQuery; //parses laptop +11 +tum
//exchange 11 tum for asus eee (the other commonQuery and the last echo of $processedQuery shows the correct laptop +"asus eee"
You are confusing the input to your function. I'm getting the desired result with a small modification:
11 +tum
laptop +"11 tum"
asus +eee
laptop +"11 tum"
Your error is this line:
$commonQuery = array("11 tum, asus eee")
This is an array with just 1 member.
You want to change the array to have 2 members:
$commonQuery = array("11 tum" , "asus eee");
Here is my full code:
<?php
$processedQuery = 'laptop 11 tum';
$processedQuery = str_replace(" "," +",$processedQuery);
$commonQuery = array("11 tum" , "asus eee");
foreach ( $commonQuery as $value ) { //$commonQuery = array("11 tum, asus eee")
//compile : simulated query format error
$simulatedErrorValue = str_replace(" "," +",$value);
echo "$simulatedErrorValue\n"; //parses 11 +tum
$processedQuery = str_replace($simulatedErrorValue,'"'.$value.'"',$processedQuery);
echo "$processedQuery\n";
}
?>

how to check this special strings

I have a special strings to check with a PHP script. This is the format :
XX - XX:XX:XX - Somethings
such as :
each XX must be ?? or a pair of digit;
first XX can take every kind of digit;
second XX must be from 00 and 10;
third and fourth XX must be from 00 to 59;
somethings can be everything, it doesnt matter;
These are some example :
00 - ??:??:?? - Blablabla // OK
99 - ??:99:?? - Blablabla // NO (99 is too high)
99 - 12:50:40 - Blablabla // NO (12 is too high)
?? - AA:50:40 - Blablabla // NO (AA is not a pair of digit)
99 - 2:50:40 - Blablabla // NO (2 is not a pair of digit; I need 02)
99 -08:49:40 - Blablabla // NO (-08 need a space)
How can I do it? I think the best way is Regex, but I really don't know how to do it :) Any help is appreciated
You can do it like this
$subj = '00 - 04:38:27 - Hi';
preg_match('/^(\?\?|\d\d) - (\?\?|10|0\d):(\?\?|[0-5]\d):(\?\?|[0-5]\d) - (.*)/', $subj, $matches);
Then you can access the fields in matches:
$matches[1] = 00
$matches[2] = 04
$matches[3] = 38
$matches[4] = 27
$matches[5] = Hi
This seems to do the job (tested at http://www.spaweditor.com/scripts/regex/index.php)
/([0-9\?]{2} - (0[0-9]|10|\?\?):([0-5][0-9]|\?\?):([0-5][0-9]|\?\?) - .*)/

Categories