I'm trying to scrap some information just for learning PHP and regex and I would like to extract it from an html.
The html text is an entire webpage but it has some patterns like somehtmltext_andtags_andeverything /ajax/hovercard/user.php?id=THE_ID_I_WANT andmore_text_and_tags.
I can isolate the pattern with TextEdit in Mac, but I want separate it!
how could I make it in PHP?
Thank you in advance!
Rafael.
Sorry, I was very unclear.
I want to separate only de ID, so if you see the image, the only text you would get is 100009799451329 . If the final result is the whole sentence (ajax/hovercard/user.php?id=100009799451329) it doesn't matter, goes fine for me!
try this
$matchArr = NULL;
preg_match_all("/\/ajax\/hovercard\/user\.php\?id=(.*?)\&/", $yourStr, $matchArr);
print_r($matchArr);
You can use the following pattern to find the id:
\/ajax\/hovercard\/user.php\?id=(\d+)
See a demo.
Explanation:
\/ajax\/hovercard\/user.php\?id= will match /ajax/hovercard/user.php?id=
(\d+) captures a sequence of digits, in this case the user id.
Related
I know there are other posts with a similar name but I've looked through them and they haven't helped me resolve this.
I'm trying to get my head around regex and preg_match. I am going through a body of text and each time a link exists I want it to be extracted. I'm currently using the following:
$reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/";
which works fine until it finds one that has <br after it. Then I get the url plus the <br which means it doesn't work correctly. How can I have it so that it stops at the < without including it?
Also, I have been looking everywhere for a clear explanation of using regex and I'm still confused by it. Has anyone any good guides on it for future reference?
\S* is too broad. In particular, I could inject into your code with a URL like:
http://hax.hax/"><script>alert('HAAAAAAAX!');</script>
You should only allow characters that are allowed in URLs:
[-A-Za-z0-9._~:/?#[]#!$&'()*+,;=]*
Some of these characters are only allowed in specific places (such as ?) so if you want better validation you will need more cleverness
Instead of \S exclude the open tag char from the class:
$reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/[^<]*)?/";
You might even want to be more restrictive by only allowing characters valid in URLs:
$reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/[a-zA-Z_\-\.%\?&]*)?/";
(or some more characters)
You could use this one as presented on the:
http://regex101.com/r/zV1uI7
On the bottom of the site you got it explained step by step.
I'm a newbie here. I'm facing a weird problem in using regex in PHP.
$result = "some very long long string with different kind of links";
$regex='/<.*?href.*?="(.*?net.*?)"/'; //this is the regex rule
preg_match_all($regex,$result,$parts);
Here in this code I'm trying to get the links from the result string. But it will provide me only those links which contains .net. But I also want to get those links which have .com. For this I tried this code
$regex='/<.*?href.*?="(.*?net|com.*?)"/';
But it shows nothing.
SOrry for my bad English.
Thanks in advance.
Update 1 :
now i'm using this
$regex='/<.*?href.*?="(.*?)"/';
this rule grab all the links from the string. But this is not perfect. Because it also grabs other substrings like "javascript".
The | character applies to everything within the capturing group, so (.*?net|com.*?) will match either .*?net or com.*?, I think what you want is (.*?(net|com).*?).
If you do not want the extra capturing group, you can use (.*?(?:net|com).*?).
You could also use (.*?net.*?|.*?com.*?), but this is not recommended because of the unnecessary repetition.
Your regex gets interpreted as .*?net or com.*?. You'll want (.*?(net|com).*?).
Try this:
$regex='/<.*?href.*?="(.*?\.(?:net|com)\b.*?)"/i';
or better:
$regex='/<a .*?href\s*+=\s*+"\K.*?\.(?:net|com)\b[^"]*+/i';
<.*?href
is a problem. This will match from the first < on the current line to the first href, regardless of whether they belong to the same tag.
Generally, it's unwise to try and parse HTML with regexes; if you absolutely insist on doing that, at least be a bit more specific (but still not perfect):
$regex='/<[^<>]*href[^<>=]*="(?:[^"]*(net|com)[^"]*)"/';
I'm a bit out of my depth here, but I'm trying to let users write links, not entirely unlike how StackOverflow formats links.
Ideally, when a user enters [text to link here][23] in a text area, they will end up with text to link here when that information is written from the database to the page.
My part I'm failing is how to extract text to link and 23 from the matched pattern
This is the pattern I'm using, but a preg_replace using this will only allow me to replace the entire string, not operate on it. I might be mistaken in my use of preg_replace.
You can operate with preg_replace_callback, but even that is unnecessary:
preg_replace('/\[(.+?)\]\[(\d+)?\]/', '\1', $st);
This seems to be the trick you're looking for: ([[a-zA-Z0-9- ]+])([[0-9]+])
The parenthesis ( and ) are used for grouping...
I'm stuck with php preg_match_all function. Maybe someone wil help me with regexp. Let's assume we have some code:
[a]a[/a]
[s]a[/s]
[b]1[/b]
[b]2[/b]
...
...
[b]n[/b]
[e]a[/e]
[b]8[/b]
[b]9[/b]
...
...
[b]n[/b]
I need to match all that inside [b] tags located between [s] and [e] tags. Any ideas?
if your structure is exactly the same as above I would personally avoid regex (not a good idea with these fort of languages) and just check the second char of each line. Once you see an s go into consume mode and for each line until you see an e find the first ] and read in everything between that and the next [
For simplicity use two preg_match calls.
First to retrieve the list you want to inspect /\[s](.+?)\[e]/s.
And then use that result string and match for the contained /\[b](.+?)\[\/b]/s things.
It looks like you are trying to pattern match something that has a treelike structure, essentially like HTML or XML. Any time you find yourself saying "find X located inside matching Y tags" you are going to have this problem.
Trying to do this sort of work with with regular expressions is a Bad Idea.
Here's some info copy/pasted from a different answer of mine for a similar question:
Some references to similar SO posts which will give you an idea of the difficulty you're getting into:
Regex to match all HTML tags except <p> and </p>
Regex to replace all \n in a String, but no those inside [code] [/code] tag
RegEx match open tags except XHTML self-contained tags - bobince says it much more thoroughly than I do (:
The "Right Thing" to do is to parse your input, maintaining state as you go. This can be as simple as scanning your text and keeping a stack of current tags.
Regular expressions alone aren't sufficient to parse XML, and this appears to be a simplified XML language here.
I need to preg_match for
src="http:// "
where the blank space following // is the rest of the url ending with the ". My adapted doesn't seem to work:
preg_match('#src="(http://[^"]+)#', $data, $match);
And I am also struggling to get text that starts with > and ends with EITHER a full stop . or an exclamation mark ! or a question mark ? I have no idea how to do this one. An example of the text I want to preg_match for is:
blahblahblah>Hello world this is what I want.
I'm hoping a kind preg_match guru can tell me the answer and save me hours of headscratching.
Thanks for reading.
As for the URL:
preg_match('#src="(.*?)"#', $data, $match);
and for the second case, use />(.*?)(\.|!|\?)/
(.*?)" will match any character greedily up until the time it sees the end double quote
It seems that you want to parse a document or string which follows a HTML, DOM, XML or something similiar structure.
Use XPath, and parse to the Tag and let it return the src Attribute, this will save much trouble and you can forget about regular expressions.
Example: CLICK ME