php regex for parsing html [duplicate] - php

This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 3 years ago.
i need some help to parse a html, extracting everything starting with http://, containing "abc" until first occurance of " or ' or blank space.
i have some regex like this /http:\/\/abc(.*)\"/ but it's not working well :\
are there any ideas? :)
P.S. sorry for bad english, it's not my natural language ;)

StackOverflow tends to prefer an HTML Document Parser over Regular Expressions for parsing HTML.
However, with that said, if you just want URLs from a string that happens to be HTML, I still believe a Regex is fine for the job.
Try preg_match_all:
preg_match_all("/http:\/\/[^\s'\"]*abc[^\s'\"]*/", $string, $matches);

Use a parser instead of a regex.
RegEx match open tags except XHTML self-contained tags

If all you want to do is extract URLs, regexen are a good choice. You don't need to get into the parser world.
If you have unix-like command tools you could approximate it very simply (assuming one url per line) with two passes:
grep http myfile.html | grep abc
You can use preg_grep() similarly.
preg_match_all ('/http:[^"\' ]+/', $html, $urls);
# $urls contains all the urls from your document
$abc_urls = preg_grep( '/abc/', $urls );

Related

Capturing text within HTML tag using PHP and preg_match [duplicate]

This question already has answers here:
PHP parse/syntax errors; and how to solve them
(20 answers)
Closed 5 years ago.
I am hitting a road block with a script I have to check availability on a certain website. I need the text within html tags and I am unsure how to approach it.
My code I have tested ended with this:
<?php
ini_set("allow_url_fopen", 1);
$homepage2 = file_get_contents('https://www.someurlwithavailability.com');
//URL has the following HTML tag: <div id="Availability">
Availability: Special Offer, ships within 10 - 15 business days </div>"
preg_match("/<div id="Availability">(.*?)</div>/si", $homepage2, $avail);
print_r($avail);
echo '<br>', '~Availability is~', '<br>', $avail, '<br>';
$stringavail=implode(" ",$avail);
echo $stringavail;
?>
I get various errors depending on what I put after preg_match(***,$homepage2, $avail); and I am unsure about what syntax I need to enter to retrieve the text.
My code above gives me this:
Parse error: syntax error, unexpected 'Availability' (T_STRING) in /u/o/placeiamrunningthecodefrom.php on line 6
The URL that is requested comes back with a full HTML page that is quite large. This HTML tag is unique and does not repeat.
Anyone able to help me out?
Although this can work just fine with regex. It's not recommended, nor easier.
Id suggest giving DOMDocument::getElementById a go. It even has an example right on the page:
$doc = new DomDocument;
// We need to validate our document before refering to the id
$doc->validateOnParse = true;
$doc->Load('book.xml');
echo "The element whose id is 'php-basics' is: " . $doc->getElementById('php-basics')->tagName . "\n";
Now to get the content instead of tagName we can use ->textContent as inherited from domnode
Try using single quotes around that pattern.
And, make sure you are escaping the special regex characters.
And, you are essentially asking for everything to the last </div>. So, you need to be more specific.
'/<div id="Availability">([^<]*)<\/div>/si'
instead of
"/<div id="Availability">(.*?)</div>/si"
Of course, this could still be unreliable if there is html in that the <div>
But, this should get you closer.
Also, try an online regex tool. I like this one.
https://regex101.com/
The problem is that you have double quotes inside your double-quoted string, and didn't escape them:
preg_match("/<div id="Availability">(.*?)</div>/si", $homepage2, $avail);
^ ^
If you used a decent IDE it would have alerted you to this as you were typing.
Simply change the delimiting quotes to single quotes.
Also, since your regexp delimiter / appears in the regular expression, you either need to escape the character where it appears in the regexp, or use a delimiter that isn't in the expression.
preg_match('#<div id="Availability">(.*?)</div>#si', $homepage2, $avail);
However, using regular expressions to parse HTML is generally a bad idea. You should use a DOM parser library like the DOMDocument class.

Remove div php - part of string [duplicate]

This question already has answers here:
How to remove text between tags in php?
(6 answers)
Strip HTML tags and its contents
(2 answers)
Closed 9 years ago.
I want to remove part of string between two html tags. I have something like this:
$variable = "This is something that I don't want to delete<blockquote>This is I want to delete </blockquote>";
the problem is that the string between blockquote tag is changing, and its need to be deleted, no matter what it is. Anyone now how?
Regex are not the best thing to parse html string, you should take a look at Simple HTML dom parser or the php DOMDocument class.
If you still want to use a regex in this case it will be for example :
$variable = preg_replace('/<blockquote>.+<\/blockquote>/siU', '', $variable);
Test it there.
You can use regular expressions, but this is by no means fail-safe and should only be used in trivial cases. A better way is to use a full-fledged HTML parser.
<?php
$str = preg_replace('#<blockquote>.*</blockquote>#siU', '', $str);
?>
please try using regex in this way
<?php
$variable = "This is something that I don't want to delete<blockquote>This is I want to delete </blockquote>";
$str = preg_replace('#(<blockquote>).*?(</blockquote>)#', '$1$2', $variable);
print($str);
?>

Strip RTF strings with PHP - Regex [duplicate]

This question already has answers here:
Regular Expression for extracting text from an RTF string
(11 answers)
Closed 9 years ago.
A column in the database I work with contains RTF strings, I would like to strip these out using PHP, leaving just the sentence between.
It is a MS SQL database 2005 if I recall correctly.
An example of the kind of strings pulled from the database (need any more let me know, all the rest are similar):
{\rtf1\ansi\ansicpg1252\deff0\deflang2057{\fonttbl{\f0\fnil\fcharset0 Tahoma;}}
\viewkind4\uc1\pard\lang1033\f0\fs17 ASSEMBLE COMPONENTS AS DETAILED ON DRAWING.\lang2057\fs17\par
}
I would like this to be stripped to only return:
ASSEMBLE COMPONENTS AS DETAILED ON DRAWING.
Now, I have successfully managed to strip the characters in ASP.NET for a previous project, however I would like to do so using PHP. Here is the regular expression I used in ASP.NET, which works flawlessly may I add:
"(\{.*\})|}|(\\\S+)"
However when I try to use the same expression in PHP with a preg_replace it does not strip half of the characters.
Any regex gurus out there?
Use this code. it will work fine.
$string = preg_replace("/(\{.*\})|}|(\\\S+)/", "", $string);
Note that I added a '/' in the beginning and at the end '/' in the regex.

PHP Regular expression: Get all urls with question mark

I have this regular expression:
preg_match_all("/<a\s.*?href\s*=\s*['|\"](.*?)(?=#|\"|')/si", $data, $matches);
to find all urls, it works fine, BUT how can I modificate it to find urls with question marks ONLY?
Example:
0123
And preg_match_all will return:
http://site.com/index.php?id=1
http://site.com/calc/index.php?id=1&scheme=Venus
preg_match_all("#<a\s*href\s*=[\'\"]([^\'\"]+\?[^\'\"]+)[\'\"]#si", $data, $matches);
Try this.
Don't try to make everything happen in one regex. Use your existing method, and then separately check the URL that you get back to see if it has a question mark in it.
That said, don't use regular expressions to parse HTML. You cannot reliably parse HTML with regular expressions, and you will face sorrow and frustration down the road. As soon as the HTML changes from your expectations, your code will be broken. See http://htmlparsing.com/php for examples of how to properly parse HTML with PHP modules that have already been written, tested and debugged.
Andy Lester gave you the answer with right thing to do.
Here's your regex though:
<a\s.*?href\s*=\s*['|\"](.*?\?.*?)(?=#|\"|')
as seen here:
http://rubular.com/r/LHi11VMMR9

PHP: Regular Expression to get a URL from a string [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicates:
Identifying if a URL is present in a string
Php parse links/emails
I'm working on some PHP code which takes input from various sources and needs to find the URLs and save them somewhere. The kind of input that needs to be handled is as follows:
http://www.youtube.com/watch?v=IY2j_GPIqRA
Try google: http://google.com! (note exclamation mark is not part of the URL)
Is http://somesite.com/ down for anyone else?
Output:
http://www.youtube.com/watch?v=IY2j_GPIqRA
http://google.com
http://somesite.com/
I've already borrowed one regular expression from the internet which works, but unfortunately wipes the query string out - not good!
Any help putting together a regular expression, or perhaps another solution to this problem, would be appreciated.
Jan Goyvaerts, Regex Guru, has addressed this issue in his blog. There are quite a few caveats, for example extracting URLs inside parentheses correctly. What you need exactly depends on the "quality" of your input data.
For the examples you provided, \b(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&##/%=~_|$?!:,.]*[A-Z0-9+&##/%=~_|$] works when used in case-insensitive mode.
So to find all matches in a multiline string, use
preg_match_all('/\b(?:(?:https?|ftp|file):\/\/|www\.|ftp\.)[-A-Z0-9+&##\/%=~_|$?!:,.]*[A-Z0-9+&##\/%=~_|$]/i', $subject, $result, PREG_PATTERN_ORDER);
$result = $result[0];
Why not try this one. It is the first result of Googling "URL regular expression".
((https?|ftp|gopher|telnet|file|notes|ms-help):((\/\/)|(\\\\))+[\w\d:##%\/;$()~_?\+-=\\\.&]*)
Not PHP, but it should work, I just slightly modified it by escaping forward slashes.
source

Categories