Simple PHP code for extracting data from the HTML source code - php

I know I can use xpath, but in this case it wouldn't work because of the complexity of the navigation of the site.
I can only use the source code.
I have browsed all over the place and couldn't find a simple php solution that would:
Open the HTML source code page (I already have an exact source code page URL).
Select and extract the text between two codes. Not between a div. But I know the start and end variables.
So, basically, I need to extract the text between
knownhtmlcodestart> Text to extract <knownhtmlcodeend
What I'm trying to achieve in the end is this:
Go to a source code URL.
Extract the text between two codes.
Store the data temporarily (define the time manually for how long) on my web server in a simple text file.
Define the waiting time and then repeat the whole process again.
The website that I'm going to extract data from is changing dynamically. So it would always store new data into the same file.
Then I would use that data (but that's a question for another time).
I would appreciate it if anyone could lead me to a simple solution.
Not asking to write a code, but maybe someone did anything similar and sharing the code here would be helpful.
Thanks

I (shamefully) found the following function useful to extract stuff from HTML. Regexes sometimes are too complex to extract large stuff, e.g. a whole <table>
/*
$start - string marking the start of the sequence you want to extract
$end - string marking the end of it..
$offset - starting position in case you need to find multiple occurrences
returns the string between `$start` and `$end`, and the indexes of start and end
*/
function strExt($str, $start, $end = null, $offset = 0)
{
$p1 = mb_strpos($str,$start,$offset);
if ($p1 === false) return false;
$p1 += mb_strlen($start);
$p2 = $end === null ? mb_strlen($str) : mb_strpos($str,$end, $p1+1);
return
[
'str' => mb_substr($str, $p1, $p2-$p1),
'start' => $p1,
'end' => $p2];
}

This would assume the opening and closing tag are on the same line (as in your example). If the tags can be on separate lines, it wouldn't be difficult to adapt this.
$html = file_get_contents('website.com');
$lines = explode("\n", $html);
foreach($lines as $word) {
$t1 = strpos($word, "knownhtmlcodestart");
$t2 = strpos($word, "knownhtmlcodeend");
if ($t1)
$c1 = $t1;
if ($t2)
$c2 = $t2;
if ($c1 && $c2){
$text = substring($word, $c1, $c2-$c1);
break;
}
}
echo $text;

Related

I have a word doc. i want to get word count per page of word doc?

i could only find solution for per line but cant find page break; also confused a lot.
for docx also cant find exact word count.
function read_doc($filename) {
$fileHandle = fopen($filename, "r");
$line = #fread($fileHandle, filesize($filename));
$lines = explode(chr(0x0D), $line);
$outtext = "";
foreach ($lines as $key => $thisline) {
if( $key > 11 ){
var_dump($thisline);
$pos = strpos($thisline, chr(0x00));
if (($pos !== FALSE) || (strlen($thisline) == 0)) {
continue;
} else {
var_dump($thisline);
$text = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t#\/\_\(\)]/", "", $thisline);
var_dump($text);
}
}
}
return $outtext;
}
Implementing your own code for this doesn't sound like a good idea. I would recommend using an external library such as PHPWord. It should allow you to convert the file to plain text. Then, you can extract the word count from it.
Also, an external library such as that adds support for a number of file formats, not restricting you to Word 97-2003.
Here's a basic piece of VB.NET code that counts words per page but be aware it depends on what Word considers to be a word, it is not necessarily what a user considers a word. In my experience you need to properly analyse how Word behaves, what it interprets and then build your logic to ensure that you get the results that you need. It's not PHP but it does the job and can be be a starting point for you.
Structure WordsPerPage
Public pagenum As String
Public count As Long
End Structure
Public Sub CountWordsPerPage(doc As Document)
Dim index As Integer
Dim pagenum As Integer
Dim newItem As WordsPerPage
Dim tmpList As New List(Of WordsPerPage)
Try
For Each wrd As Range In doc.Words
pagenum = wrd.Information(WdInformation.wdActiveEndPageNumber)
Debug.Print("Word {0} is on page {1}", wrd.Text, pagenum)
index = tmpList.FindIndex(Function(value As WordsPerPage)
Return value.pagenum = pagenum
End Function)
If index <> -1 Then
tmpList(index) = New WordsPerPage With {.pagenum = pagenum, .count = tmpList(index).count + 1}
Else
' Unique (or first)
newItem.count = 1
newItem.pagenum = pagenum
tmpList.Add(newItem)
End If
Next
Catch ex As Exception
WorkerErrorLog.AddLog(ex, Err.Number & " " & Err.Description)
Finally
Dim totalWordCount As Long = 0
For Each item In tmpList
totalWordCount = totalWordCount + item.count
Debug.Print("Page {0} has {1} words", item.pagenum, item.count)
Next
Debug.Print("Total word count is {0}", totalWordCount)
End Try
End Sub
When you unzip .doc or .docx file, you will get folder. Look for document.xml file in word subfolder. You will get whole document with xml syntax. Split string by page xml syntax, Strip xml syntax and use str_word_count.
What is figure out that i will need a windows server :-- using COM object ;;
Please check this link
https://github.com/lettertoamit/MS-Word-PER-PAGE-WORDCOUNT/blob/master/index.php

What is the fastest way to check amount of specific chars in a string in PHP?

So i need to check if amount of chars from specific set in a string is higher than some number, what a fastest way to do that?
For example i have a long string "some text & some text & some text + a lot more + a lot more ... etc." and i need to check if there r more than 3 of next symbols: [&,.,+]. So when i encounter 4th occurrence of one of these chars i just need to return false, and stop the loop. So i think to create a simple function like that. But i wonder is there any native method in php to do such a thing? But i need some function which will not waste time parsing the string till the end, cuz the string may be pretty long. So i think regexp and functions like count_chars r not suited for that kind of job...
Any suggestions?
I don't know about a native method, I think count_chars is probably as close as you're going to get. However, rolling a custom solution would be relatively simple:
$str = 'your text here';
$chars = ['&', '.', '+'];
$count = [];
$length = strlen($str);
$limit = 3;
for ($i = 0; $i < $length; $i++) {
if (in_array($str[$i], $chars)) {
$count[$str[$i]] += 1;
if ($count[$str[$i]] > $limit) {
break;
}
}
}
Where the data is actually coming from might also make a difference. For example, if it's from a file then you could take advantage of fread's 2nd parameter to only read x number of bytes at a time within a while loop.
Finding the fastest way might be too broad of a question as PHP has a lot of string related functions; other solutions might use strstr, strpos, etc...
Not benchmarked the other solutions but http://php.net/manual/en/function.str-replace.php passing an array of options will be fast. There is an optional parameter which returns the count of replacements. Check that number
str_replace ( ['&','.','+'], '' , $subject , $count )
if ($count > $number ) {
Well, all my thoughts were wrong and my expectations were crushed by real tests. RegExp seems to work from 2 to 7 times faster (with different strings) than self-made function with simple symbol-checking loop.
The code:
// self-made function:
function chk_occurs($str,$chrs,$limit){
$r=false;
$count = 0;
$length = strlen($str);
for($i=0; $i<$length; $i++){
if(in_array($str[$i], $chrs)){
$count++;
if($count>$limit){
$r=true;
break;
}
}
}
return $r;
}
// RegExp i've used for tests:
preg_match('/([&\\.\\+]|[&\\.\\+][^&\\.\\+]+?){3,}?/',$str);
Of course it works faster because it's a single call to native function, but even same code wrapped into function works from 2 to ~4.8 times faster.
//RegExp wrapped into the function:
function chk_occurs_preg($str,$chrs,$limit){
$chrs=preg_quote($chrs);
return preg_match('/(['.$chrs.']|['.$chrs.'][^'.$chrs.']+?){'.$limit.',}?/',$str);
}
P.S. i wasn't bothered to check cpu-time, just was testing walltime measured via microtime(true); of the 200k iteration loop, but it's enough for me.

Large text replace array

I'm looking for some help when replacing text from when i'm importing an XML file. I want to text-replace some values when importing, so it matches my categories, filter values etc. on my website.
I'm using this function. i wrote it myself with copy-pasting from internet (i'm not a coder) but now i need some help/advice.
<?php
// Text replace test function
function my_text_replace($x) {
for ($y = 0; $y < 2; $y = $y+1) {
$phrase = $x;
$old = array("Draaideurkast", "fout1 MRC", "Draaideurkast MRC", "Draaideurkast MRC");
$new = array("fout1", "fout2", "goed", "fout3");
$x = str_ireplace($old, $new, $phrase);
$y = $y+1;
return $x;
}
}
?>
Code Fix:
What happens is that i do not want a partial match replace, but only the complete value of $x. in the example the output should be 'goed'. it only should replace once when found. (but that is fixed with the for loop i think). the output should be case insensitive.
Advice question:
is this a correct way of replace (large amounts) of texts during an import? you guys know other best practises or plugins (wordpress) or tools..
Thanks for any response!
Harm

How to split content into equal parts then write to file in PHP

I have been reading/testing examples since last night, but the cows never came home.
I have a file with (for example) approx. 1000 characters in one line and want to split it into 10 equal parts then write back to the file.
Goal:
1. Open the file in question and read its content
2. Count up to 100 characters for example, then put a line break
3. Count 100 again and another line break, and so on till it's done.
4. Write/overwrite the file with the new split content
For example:
I want to turn this => KNMT2zSOMs4j4vXsBlb7uCjrGxgXpr
Into this:
KNMT2zSOMs
4j4vXsBlb7
uCjrGxgXpr
This is what I have so far:
<?php
$MyString = fopen('file.txt', "r");
$MyNewString;
$n = 100; // How many you want before seperation
$MyNewString = substr($MyString,0,$n);
$i = $n;
while ($i < strlen($MyString)) {
$MyNewString .= "\n"; // Seperator Character
$MyNewString .= substr($MyString,$i,$n);
$i = $i + $n;
}
file_put_contents($MyString, $MyNewString);
fclose($MyString);
?>
But that is not working quite the way I anticipated.
I realize that there are other similiar questions like mine, but they were not showing how to read a file, then write back to it.
<?php
$str = "aonoeincoieacaonoeincoieacaonoeincoieacaonoeincoieacaonoeincoieacaon";
$pieces = 10;
$ch = chunk_split($str, $pieces);
$piece = explode("\n", $ch);
foreach($piece as $line) {
// write to file
}
?>
http://php.net/manual/en/function.chunk-split.php
Hold on here. You're not giving a file name/path to file_put_contents();, you're giving a file handle.
Try this:
file_put_contents("newFileWithText.txt", $MyNewString);
You see, when doing $var=fopen();, you're giving $var a value of a handle, which is not meant to be used with file_put_contents(); as it doesnt ask for a handle, but a filename instead. So, it should be: file_put_contents("myfilenamehere.txt", "the data i want in my file here...");
Simple.
Take a look at the documentation for str_split. It will take a string and split it into chunks based on length, storing each chunk at a separate index in an array that it returns. You can then iterate over the array adding a line break after each index.

speed string search in PHP

I have a 1.2GB file that contains a one line string.
What I need is to search the entire file to find the position of an another string (currently I have a list of strings to search).
The way what I'm doing it now is opening the big file and move a pointer throught 4Kb blocks, then moving the pointer X positions back in the file and get 4Kb more.
My problem is that a bigger string to search, a bigger time he take to got it.
Can you give me some ideas to optimize the script to get better search times?
this is my implementation:
function busca($inici){
$limit = 4096;
$big_one = fopen('big_one.txt','r');
$options = fopen('options.txt','r');
while(!feof($options)){
$search = trim(fgets($options));
$retro = strlen($search);//maybe setting this position absolute? (like 12 or 15)
$punter = 0;
while(!feof($big_one)){
$ara = fgets($big_one,$limit);
$pos = strpos($ara,$search);
$ok_pos = $pos + $punter;
if($pos !== false){
echo "$pos - $punter - $search : $ok_pos <br>";
break;
}
$punter += $limit - $retro;
fseek($big_one,$punter);
}
fseek($big_one,0);
}
}
Thanks in advance!
Why don't use exec + grep -b?
exec('grep "new" ext-all-debug.js -b', $result);
// here we have looked for "new" substring entries in the extjs debug src file
var_dump($result);
sample result:
array(1142) {
[0]=> string(97) "3398: * insert new elements. Revisiting the example above, we could utilize templating this time:"
[1]=> string(54) "3910:var tpl = new Ext.DomHelper.createTemplate(html);"
...
}
Each item consists of string offset in bytes from the start of file and the line itself, separated with colon.
So after this you have to look inside the particular line and append the position to the line offset. I.e.:
[0]=> string(97) "3398: * insert new elements. Revisiting the example above, we could utilize templating this time:"
this means that "new" occurrence found at 3408th byte (3398 is the line position and 10 is the position of "new" inside this line)
$big_one = fopen('big_one.txt','r');
$options = fopen('options.txt','r');
while(!feof($options))
{
$option = trim(fgets($options));
$position = substr($big_one,$option);
if($position)
return $position; //exit loop
}
the size of the file is quite large though. you might want to consider storing the data in a database instead. or if you absolutely can't, then use the grep solution posted here.

Categories