I parse an html page into a plain text in order to find and get a numeric value.
In the whole html mess, I need to find a string like this one:
C) Debiti33.197.431,90I - Di finanziamento
I need the number 33.197.431,90 (where this number is going to change on every html parsing request.
Is there any regex to achieve this? For example:
STARTS WITH 'C) Debiti' ENDS WITH 'I - Di finanziamento' GETS the middle string that can be whatever.
Whenever I try, I get empty results...don't know that much about regex.
Can you please help me?
Thank you very much.
You could try the below regex,
^C\) Debiti\K.*?(?=I - Di finanziamento$)
DEMO
PHP code would be,
<?php
$mystring = "C) Debiti33.197.431,90I - Di finanziamento";
$regex = '~^C\) Debiti\K.*?(?=I - Di finanziamento$)~';
if (preg_match($regex, $mystring, $m)) {
$yourmatch = $m[0];
echo $yourmatch;
}
?> //=> 33.197.431,90
This should work. Read section Want to Be Lazy? Think Twice.
(?<=\bC\) Debiti)[\d.,]+(?=I - Di finanziamento\b)
Here is demo
sample code:
$re = "/(?<=\\bC\\) Debiti)[\\d.,]+(?=I - Di finanziamento\\b)/i";
$str = "C) Debiti33.197.431,90I - Di finanziamento";
preg_match($re, $str, $matches);
Related
How to use preg_replace to replace some of a link, but keep the original link as text?
I tried using https://www.phpliveregex.com/#tab-preg-replace, but preg_replace is far to complex for my knowledge.
In short I would like to transform this:
!f:\cases\case\20190813_case.pdf!
To this:
<a href='file://server-files/data/cases/case/20190813_case.pdf'>f:\cases\case\20190813_case.pdf</a>
So that the user sees the network drive as a letter, but the link is actually a link via the server name.
$string = "!f:\cases\case\20190813_case.pdf!"
$string = str_ireplace("F:\\", "file://server-files/Data/", $string);
$string = preg_replace("/\!(.*?)\!/", "<a href='$1'>$1</a>", $string);
This gives:
<a href='file://server-files/Data/cases\case\20190813_case.pdf'>file://server-files/cases/case\20190813_case.pdf</a>
It works fine, but I would like to format link text like this
<a href='file://server-files/Data/cases\case\20190813_case.pdf'>f:\cases\case\20190813_case.pdf</a>
Does anyone know if it is possible?
And it might be possible to skip the str_ireplace, and do it all in the preg_replace line?
EDIT
The actual text is like this (had to a anonymize some parts).
Vi har afleveret et skitseprojekt til et nyt domicil for XXXXX
XXXXXXXX.
Mappen kan ses her !F:\A-sager\XXXXXXXX - nyt
domicil\8-Forslag\D-Sendt\fremlagt for bygherren\20190813 domicil.pdf!
Projektet er endnu ikke offentligt.
The text is urlencoded and stored in a XML file.
There is no reason to use regular expressions for simple string replacements. Not saying you should not get over that bearer and learn them, just not needed here really.
<?php
$str = '!f:\cases\case\20190813_case.pdf!';
$str1 = substr($str, 1, strlen($str) -2);
$str2 = substr($str, 4, strlen($str) -5);
echo "<a href='file://{$str2}'>{$str1}</a>";
//<a href='file://cases\case\20190813_case.pdf'>f:\cases\case\20190813_case.pdf</a>
//if slashes are wrong...
var_dump(str_replace('\\', '/', $str1)) ;//see const DIRECTORY_SEPARATOR
//string(31) "f:/cases/case/20190813_case.pdf"
PHP has a string function for about everything you could ever need.
Update: You stated that there can be multiple links in one "string" (in a question since deleted). You've not provided an example of the format though. Assuming a delimiter of ! and you wanting to use pcre try...
<?php
$str = '!f:\cases\case\20190813_case1.pdf!!f:\cases\case\20190813_case2.pdf!!f:\cases\case\20190813_case3.pdf!';
preg_match_all('#!(.*?)!#', $str, $matches);
var_dump($matches[1]);
There are often many ways to accomplish the same basic string manipulation (strtok, explode, etc).
...Seeing your update, sounds like using some XML parser and iterating over these you should be able to use the examples I've provided, specifically the regular expression to isolate it. Watch for false positives if exclamation marks are in the text? Ask if you get stuck on anything else specific and good luck!
Typically I'd say aim to write the code that is most clear and concise. Readable.
I suggest:
$str = <<<'EOD'
Vi har afleveret et skitseprojekt til et nyt domicil for XXXXX XXXXXXXX.
Mappen kan ses her !F:\A-sager\XXXXXXXX - nyt domicil\8-Forslag\D-Sendt\fremlagt for bygherren\20190813 domicil.pdf!
Projektet er endnu ikke offentligt.
EOD;
echo preg_replace_callback('~!f:(.*?)!~i', function ($m) {
return '<a href="file://server-files/Data'
. strtr(rawurlencode($m[1]), ['%5C'=> '/'])
. '">f:' . $m[1] . '</a>';
}, $str);
I want to take a specific word from long text to make a variable. I'm using these codes;
function hashtag($str){
$regex = "/(#)+[a-zA-Z0-9]+/";
$str = preg_replace($regex, '\\0', $str);
return($str);
}
It makes a hashtag;
Text: Robert De Niro won the #oscar.
After code: Robert De Niro won the #oscar.
(___e.php?tag=#oscar)
But I want to make a variable with "oscar" and make the link like this;
Robert De Niro won the #oscar.
(___e.php?tag=oscar)
I mean If I could make a variable ($variable), I can use it wherever I want.
If you can help me I would really appreciate it.
Simple:
<?php
function hashtag($str){
$regex = "/#([a-zA-Z0-9]+)/";
$str = preg_replace($regex, '\\0', $str);
return($str);
}
The only thing to do is to capture the part after the hash tag and use this as your tag variable afterwards.
See a demo on ideone.com.
I'm trying to use a regex to find and replace all URLs in a forum system. This works but it also selects anything that is within bbcode. This shouldn't be happening.
My code is as follows:
<?php
function make_links_clickable($text){
return preg_replace('!(([^=](f|ht)tp(s)?://)[-a-zA-Zа-яА-Я()0-9#:%_+.~#?&;//=]+)!i', '$1', $text);
}
//$text = "https://www.mcgamerzone.com<br>http://www.mcgamerzone.com/help/support<br>Just text<br>http://www.google.com/<br><b>More text</b>";
$text = "#Theareak We know this and [b][url=https://www.mcgamerzone.com/news/67/False-positive-proxy-bans-and-bot-attacks]here[/url] [/b]is an explanation, we are trying to fix this asap! https://www.mcgamerzone.com/news/67/False-positive-proxy-bans-and-bot-attacks aaa";
echo "<b>Unparsed text:</b><br>";
echo $text;
echo "<br><br>";
echo "<b>Parsed text:</b><br>";
echo make_links_clickable($text);
?>
All urls that occur in bb-code are following up on a = character, meaning that I don't want anything that starts with = to be selected.
I basically have that working but this results in selecting 1 extra character in in front of the string that should be selected.
I'm not very familiar with regex. The final output of my code is this:
<b>Unparsed text:</b><br>
#Theareak We know this and [b][url=https://www.mcgamerzone.com/news/67/False-positive-proxy-bans-and-bot-attacks]here[/url] [/b]is an explanation, we are trying to fix this asap! https://www.mcgamerzone.com/news/67/False-positive-proxy-bans-and-bot-attacks aaa<br>
<br>
<b>Parsed text:</b><br>
#Theareak We know this and [b][url=https://www.mcgamerzone.com/news/67/False-positive-proxy-bans-and-bot-attacks]here[/url] [/b]is an explanation, we are trying to fix this asap! https://www.mcgamerzone.com/news/67/False-positive-proxy-bans-and-bot-attacks aaa
You can match and skip [url=...] like this:
\[url=[^\]]*](*SKIP)(?!)|(((f|ht)tps?://)[-a-zA-Zа-яёЁА-Я()0-9#:%_+.\~#?&;/=]+)
See regex demo
That way, you will only match the URLs outside the [url=...] tag.
IDEONE demo:
function make_links_clickable($text){
return preg_replace('~\[url=[^\]]*](*SKIP)(?!)|(((f|ht)tps?://)[-a-zA-Zа-яёЁА-Я()0-9#:%_+.\~#?&;/=]+)~iu', '$1', $text);
}
$text = "#Theareak We know this and [b][url=https://www.mcgamerzone.com/news/67/False-positive-proxy-bans-and-bot-attacks]here[/url] [/b]is an explanation, we are trying to fix this asap! https://www.mcgamerzone.com/news/67/False-positive-proxy-bans-and-bot-attacks aaa";
echo "<b>Parsed text:</b><br>";
echo make_links_clickable($text);
You can use a negative lookbehind (?<!=) instead of your negated class. It asserts that what is going to be matched isn't preceded by something.
Example
Let's say I have a page I want to scrape for words with "ice" in them, how can I do this easily? I see a lot of scrapers breaking things down into source code, but I don't need this. I just need something that searches through the plain text on the webpage.
Edit: I basically need something to search for .jpeg and find the entire file name. (it is in plain text on the website, not hidden in a tag)
Anything that matches the following is a word with ice in it:
/(\w*)ice(\w*)/i
(Do note that \w matches 0-9 and _ too. The following might give better results: /\b.*?ice\b.*?/i)
UPDATE
To match file names (must not contain whitespace):
/\S+\.jpeg/i
Example:
<?php
$str = 'Picture of me: 238484534.jpeg and someone else img-of-someone.jpeg here';
$cnt = preg_match_all('/\S+\.jpeg/i', $str, $matches);
print_r($matches);
1.do u want to read the word inside the HTML tags too like attribute,textname ?
2.Or only the visible part of the webpage ?
for#1 : solutions are simple and already there as mentioned in other answers.
for#2:
Use PHP DOMDOCUMENT class, and extract and search in innerHTML only.
documentation here :
http://php.net/manual/en/class.domdocument.php
see this for example:
PHP DOMDocument stripping HTML tags
Some regex use will be needed for this. Below I use PCRE http://www.php.net/manual/en/ref.pcre.php and the function preg_match http://www.php.net/manual/en/function.preg-match-all.php
<?php
$html = <<<EOF
<html>
<head>
<title>Test</title>
</head>
<body>List of files:
<ul>
<li>test1.jpeg</li>
<li>test2.jpeg</li>
</ul>
</body>
</html>
EOF;
$matches = array();
$count = preg_match_all("([0-9a-zA-Z_-]+\.jpeg)", $html, $matches);
if (count($matches) > 1) {
for ($i = 1; $i < count($matches); $i++) {
print "Filename: {$matches[$i]}\n";
}
}
?>
try this:
preg_match_all('/\w*ice\w*/', 'abc icecream lice', $matches);
print_r($matches);
<hr>I want to remove this text.<embed src="stuffinhere.html"/>
I tried using regex but nothing works.
Thanks in advance.
P.S. I tried this: $str = preg_replace('#(<hr>).*?(<embed)#', '$1$2', $str)
You'll get a lot of advice to use an HTML parser for this kind of thing. You should do that.
The rest of this answer is for when you've decided that the HTML parser is too slow, doesn't handle ill formed (i.e. standard in the wild) HTML, or is a pain in the ass to integrate into the system you don't control. I created the following small shell script
$str = '<hr>I want to remove this text.<embed src="stuffinhere.html"/>';
$str = preg_replace('#(<hr>).*?(<embed)#', '$1$2', $str);
var_dump($str);
//outputs
string(35) "<hr><embed src="stuffinhere.html"/>"
and it did remove the text, so I'd check your source documents and any other PHP code around your RegEx. You're not feeding preg_replace the string you think you are. My best guess is your source document has irregular case, or there's whitespace between the <hr /> and <embed>. Try the following regular expression instead.
$str = '<hr>I want to remove
this text.
<EMBED src="stuffinhere.html"/>';
$str = preg_replace('#(<hr>).*?(<embed)#si', '$1$2', $str);
var_dump($str);
//outputs
string(35) "<hr><EMBED src="stuffinhere.html"/>"
The "i" modifier says "make this search case insensitive". The "s" modifier says "the [.] character should also match my platform's line break/carriage return sequence"
But use a proper parser if you can. Seriously.
I think the code is self-explanatory and pretty easy to understand since it does not use regex (and it might be faster)...
$start='<hr>';
$end='<embed src="stuff...';
$str=' html here... ';
function between($t1,$t2,$page) {
$p1=stripos($page,$t1);
if($p1!==false) {
$p2=stripos($page,$t2,$p1+strlen($t1));
} else {
return false;
}
return substr($page,$p1+strlen($t1),$p2-$p1-strlen($t1));
}
$found=between($start,$end,$str);
while($found!==false) {
$str=str_replace($start.$found.$end,$start.$end,$str);
$found=between($start,$end,$str);
}
// do something with $str here...
$text = '<hr>I want to remove this text.<embed src="stuffinhere.html"/>';
$text = preg_replace('#(<hr>).*?(<embed.*?>)#', '$1$2', $text);
echo $text;
If you want to hard code src in embed tag:
$text = '<hr>I want to remove this text.<embed src="stuffinhere.html"/>';
$text = preg_replace('#(<hr>).*?(<embed src="stuffinhere.html"/>)#', '$1$2', $text);
echo $text;