I'm working with indexing some news sites. A kind of news clipping.
I'm an amateur and curious. I'm not a programmer so the question may seem silly to anyone in the business. But if anyone can help, thank you.
The paging of the sites I was doing parsing was practically the same and I used this scheme:
$url = $ url. '/page/'. $s;
$next_url = $s + 1;
$prev_url = $s - 1;
if ($prev_url <= 0) {
$prev_url = 1;
}
The format was basically this:
http://example.com/politics/page/2
But yesterday I came across something different and I do not know how to page. I get this link format through preg_match_all:
http://www.example.com/browse-Politics-National-texts-1-date.html
This is the paging part:
-1-
This part is variable:
Political-National-texts
Any guidance?
If what you are asking for is parsing the url for the pagination and variable parts, you can use preg_match with the following regexp:
if (preg_match('/^http:\/\/www.example.com\/browse-([-a-zA-Z]+)-(\d+)-date\.html$/', $url, $matches)) {
var_export($matches);
}
Then you will get the result:
array (
0 => 'http://www.example.com/browse-Politics-National-texts-1-date.html',
1 => 'Politics-National-texts',
2 => '1',
)
The keys in $matches will be:
0: The entire match
1: The first matched group (the variable)
2: The second matched group (the pagination)
<?php
$url = 'http://www.example.com/browse-Politics-National-texts-1-date.html'
$url_basename = basename($url); // extract `browse-Politics-National-texts-1-date.html`
$url_exploded = explode('-',$url_basename); // make an array delimited by `-`
array_pop($url_exploded);
$url_page_number = array_pop($url_exploded); // get the 2nd element from back
?>
Result:
$url_page_number = 1
PS. Could make it shorter, but it's for educational purposes :-)
Related
I know I can use xpath, but in this case it wouldn't work because of the complexity of the navigation of the site.
I can only use the source code.
I have browsed all over the place and couldn't find a simple php solution that would:
Open the HTML source code page (I already have an exact source code page URL).
Select and extract the text between two codes. Not between a div. But I know the start and end variables.
So, basically, I need to extract the text between
knownhtmlcodestart> Text to extract <knownhtmlcodeend
What I'm trying to achieve in the end is this:
Go to a source code URL.
Extract the text between two codes.
Store the data temporarily (define the time manually for how long) on my web server in a simple text file.
Define the waiting time and then repeat the whole process again.
The website that I'm going to extract data from is changing dynamically. So it would always store new data into the same file.
Then I would use that data (but that's a question for another time).
I would appreciate it if anyone could lead me to a simple solution.
Not asking to write a code, but maybe someone did anything similar and sharing the code here would be helpful.
Thanks
I (shamefully) found the following function useful to extract stuff from HTML. Regexes sometimes are too complex to extract large stuff, e.g. a whole <table>
/*
$start - string marking the start of the sequence you want to extract
$end - string marking the end of it..
$offset - starting position in case you need to find multiple occurrences
returns the string between `$start` and `$end`, and the indexes of start and end
*/
function strExt($str, $start, $end = null, $offset = 0)
{
$p1 = mb_strpos($str,$start,$offset);
if ($p1 === false) return false;
$p1 += mb_strlen($start);
$p2 = $end === null ? mb_strlen($str) : mb_strpos($str,$end, $p1+1);
return
[
'str' => mb_substr($str, $p1, $p2-$p1),
'start' => $p1,
'end' => $p2];
}
This would assume the opening and closing tag are on the same line (as in your example). If the tags can be on separate lines, it wouldn't be difficult to adapt this.
$html = file_get_contents('website.com');
$lines = explode("\n", $html);
foreach($lines as $word) {
$t1 = strpos($word, "knownhtmlcodestart");
$t2 = strpos($word, "knownhtmlcodeend");
if ($t1)
$c1 = $t1;
if ($t2)
$c2 = $t2;
if ($c1 && $c2){
$text = substring($word, $c1, $c2-$c1);
break;
}
}
echo $text;
In PHP I am using stristr search but I also want to return and display the surrounding characters each side of the found string, like concordancer. About 9 or 10 characters should be enough, for example:
'below' is the string I have found, but then I want to display context for my users, like this:
'if I sat below the deck'
My current code is
if (stristr($order, $en[$i])) {
$lit= "".$en[$i]."";
$order = str_replace($en[$i], $lit, $order, $count);
$total = $total + $count;
$sub[$loc[$i]] = $sub[$loc[$i]] + $count;
$carreau[$loc[$i]] = $carreau[$loc[$i]]."".$lit." ";
}
Thank you so much for any help.
You want something like $context = substr($order,from,length)
How to calculate from and length?
Use stripos (and arithmetic) to calculate from. Beware it isn't less than 0. Review substr doc for the why).
Use strlen (of the "needle") and arithmetic to determine the length.
I am trying to match a full UK postcode against a partial postcode.
Take a users postcode, i.e g22 1pf, and see if there's a match / partial match in the array / database.
//Sample data
$postcode_to_check= 'g401pf';
//$postcode_to_check= 'g651qr';
//$postcode_to_check= 'g51rq';
//$postcode_to_check= 'g659rs';
//$postcode_to_check= 'g40';
$postcodes = array('g657','g658','g659','g659pf','g40','g5');
$counter=0;
foreach($postcodes as $postcode){
$postcode_data[] = array('id' =>$counter++ , 'postcode' => $postcode, 'charge' => '20.00');
}
I do have some code but that was just comparing the strings with fixed lengths from the database. I need the strings in the array / database to be dynamic in length.
The database may contain "g22" this would be a match, it could also contain more or less of the postcode, i.e "g221" or "g221p" which would also be a match. It could contain "g221q" or "g221qr" these would not match.
Help Appreciated, Thank you
edit.
I was possibly overthinking this. the following pseudo code seems to function as expected.
check_delivery('g401pf');
//this would match because g40 is in the database.
check_delivery('g651dt');
// g651dt this would NOT match because g651dt is not in the database.
check_delivery('g524pq');
//g524pq this would match because g5 is in the database.
check_delivery('g659pf');
//g659pf this would match because g659 is in the database.
check_delivery('g655pf');
//g655pf this would not match, g665 is not in the database
//expected output, 3 matches
function check_delivery($postcode_to_check){
$postcodes = array('g657','g658','g659','g659pf','g40','g5');
$counter=0;
foreach($postcodes as $postcode){
$stripped_postcode = substr($postcode_to_check,0, strlen($postcode));
if($postcode==$stripped_postcode){
echo "Matched<br><br>";
break;
}
}
}
<?php
$postcode_to_check= 'g401pf';
$arr = preg_split("/\d+/",$postcode_to_check,-1, PREG_SPLIT_NO_EMPTY);
preg_match_all('/\d+/', $postcode_to_check, $postcode);
$out = implode("",array_map(function($postcode) {return implode("",$postcode);},$postcode));
$first_char = mb_substr($arr[1], 0, 1);
$MatchingPostcode=$arr[0].''.$out.''.$first_char;
echo $MatchingPostcode;
SELECT * FROM my_table WHERE column_name LIKE '%$MatchingPostcode%';
It's a dirty solution but it will solve your problem. Things like this should be handled in front-end or in the DB but if you must do it in php then this is a solution.
So this code will match you anything that includes g401p. If you don't want to match in the start of the end just remove % from which part you don't want to match. In my the case i provide you it will search for every column record that has g401p
Check the length and strip the one you want to compare it with to the same length.
function check_delivery($postcode_to_check){
$postcodes = array('g657','g658','g659','g659pf','g40','g5');
$counter=0;
foreach($postcodes as $postcode){
$stripped_postcode = substr($postcode_to_check,0, strlen($postcode));
if($postcode==$stripped_postcode){
echo "Matched<br><br>";
break;
}
}
}
I am looking for the most efficient algorithm in PHP to check if a string was made from dictionary words only or not.
Example:
thissentencewasmadefromenglishwords
thisonecontainsyxxxyxsomegarbagexaatoo
pure
thisisalsobadxyyyaazzz
Output:
thissentencewasmadefromenglishwords
pure
a.txt
contains the dictionary words
b.txt
contains the strings: one in every line, without spaces made from a..z chars only
Another way to do this is to employ the Aho-Corasick string matching algorithm. The basic idea is to read in your dictionary of words and from that create the Aho-Corasick tree structure. Then, you run each string you want to split into words through the search function.
The beauty of this approach is that creating the tree is a one time cost. You can then use it for all of the strings you're testing. The search function runs in O(n) (n being the length of the string), plus the number of matches found. It's really quite efficient.
Output from the search function will be a list of string matches, telling you which words match at what positions.
The Wikipedia article does not give a great explanation of the Aho-Corasick algorithm. I prefer the original paper, which is quite approachable. See Efficient String Matching: An Aid to Bibliographic Search.
So, for example, given your first string:
thissentencewasmadefromenglishwords
You would get (in part):
this, 0
his, 1
sent, 4
ten, 7
etc.
Now, sort the list of matches by position. It will be almost sorted when you get it from the string matcher, but not quite.
Once the list is sorted by position, the first thing you do is make sure that there is a match at position 0. If there is not, then the string fails the test. If there is (and there might be multiple matches at position 0), you take the length of the matched string and see if there's a string match at that position. Add that match's length and see if there's a match at the next position, etc.
If the strings you're testing aren't very long, then you can use a brute force algorithm like that. It would be more efficient, though, to construct a hash map of the matches, indexed by position. Of course, there could be multiple matches for a particular position, so you have to take that into account. But looking to see if there is a match at a position would be very fast.
It's some work, of course, to implement the Aho-Corasick algorithm. A quick Google search shows that there are php implementations available. How well they work, I don't know.
In the average case, this should be very quick. Again, it depends on how long your strings are. But you're helped by there being relatively few matches at any one position. You could probably construct strings that would exhibit pathologically poor runtimes, but you'd probably have to try real hard. And again, even a pathological case isn't going to take too terribly long if the string is short.
This is a problem that can be solved using Dynamic Programming, based on the next formulas:
f(0) = true
f(i) = OR { f(i-j) AND Dictionary.contais(s.substring(i-j,i) } for each j=1,...,i
First, load your file into a dictionary, then use the DP solution for the above formula.
Pseudo code is something like: (Hope I have no "off by one" for indices..)
check(word):
f = new boolean[word.length() + 1)
f[0] = true
for i from 1 to word.length() + 1:
f[i] = false
for j from 1 to i-1:
if dictionary.contains(word.substring(j-1,i-1)) AND f[j]:
f[i] = true
return f[word.length()
I recommend a recursive approach. Something like this:
<?php
$wordsToCheck = array(
'otherword',
'word1andother',
'word1',
'word1word2',
'word1word3',
'word1word2word3'
);
$wordList = array(
'word1',
'word2',
'word3'
);
$results = array();
function onlyListedWords($word, $wordList) {
if (in_array($word, $wordList)) {
return true;
} else {
$length = strlen($word);
$wordTemp = $word;
$part = '';
for ($i=0; $i < $length; $i++) {
$part .= $wordTemp[$i];
if (in_array($part, $wordList)) {
if ($i == $length - 1) {
return true;
} else {
$wordTemp = substr($wordTemp, $i + 1);
return onlyListedWords($wordTemp, $wordList);
}
}
}
}
}
foreach ($wordsToCheck as $word) {
if (onlyListedWords($word, $wordList))
$results[] = $word;
}
var_dump($results);
?>
Is there a way that I can search a variable starting from a given position and find the start position of a string that is in the variable backwards from the given start position.
So for example if I initially do $getstart = strpos($contents, 'position', 0);
I then want to do $getprevpos = prevstrpos($contents, 'previous token', $getstart);
Obviously there is no such function as prevstrpos but I hope you get what I mean.
Example text area (terrible example I now):
Here is an example where I want to find the previous token once I have found the start position of a text string.
you can strrpos( substr($contents, 0, $getstart), 'previous token')
Is there something wrong with strrpos()? If 'offset' is negative: "Negative values will stop searching at the specified point prior to the end of the string."
you can try this. I think it should would for all cases but you should probly test it a bit. Might be a bug here and there but you get the idea. Reverse everything and do a strpos on the reversed string
prevstrpos( $contents, $token, $start )
{
$revToken = strrev($token);
$revStart = strlen($token) - $start;
$revContent = strrev($content);
$revFoundPos = strpos( $revContent, $revToken, $revStart );
if( $revFoundPos != -1 )
{
$foundPos = strlen($token) - $revFoundPos;
}
else
{
$foundPos = -1;
}
return $foundPos;
}