i have a following pattern, inside the html file, that i would like to parse in php to get a link but for now i dont see a solution as i am trying to use QueryPath and my case is simply not a common dom element:
<script>
to.addVariable("site_name","http://www.sitename.com");
</script>
I just would like to return a link part of that pattern in order to print it.
Hope someone could recommend how to.
Thank you.
UPDATE: I would like to get http://www.sitename.com as a value from the code above using php, maybe with phpQuery or QueryPath.
Something like this I guess will work
<?PHP
$text = '
<script>
to.addVariable("site_name","http://www.sitename.com");
</script>
';
preg_match('#to\.addVariable\("site_name","([^"]+)"\);#', $text, $matches);
echo $matches[1];
?>
You can also use preg_match_all if you have more than one to.addVariable(... strings in your <script> section.
Try this regular exp:
$regex = '#to\.addVariable\("(.+?)", "(.+?)"\)#';
Then, use preg_match_all to get the matches. If you want to check that the URL is an actual URL, the get any regular expression to match URLs and place it in the second .+?, these patterns will match anything between "", so you should check that you have what you need unless you trust the source.
NOTE: I'm not so sure that " doesn't needs to be escaped in regex, so check it out
Hope I can help!
If you don't understand something drop a comment!
Related
I have a little problem with preg_match function in PHP. I think that I never will learn how to use this function. I want to extract URL of image from HTML without name of image. For example, if I have some link for image:
"/data/images/2013-10-03/someimage.jpg"
or
"http://something.com//data/images/2013-10-03/someimage.jpg"
How can I use preg_match function to delete everything left of last forward slash, so I can get only image name from URL?
Maybe it's smarter to use different function but I dont know which one?
P.S. Can you give me some good tutorial for preg_match function?
Maybe I forgot to say... I dont know how long is image name or what is image name exactly. I need function for extract only what is on right side from last forward slash.
$pattern = '/[\w\-]+\.(jpg|png|gif|jpeg)/';
$subject = 'http://something.com//data/images/2013-10-03/someimage.png';
$result = preg_match($pattern, $subject, $matches);
echo $matches[0]; //someimage.jpg
No need for regex or anything fancy:
$var = "http://something.com/data/images/2013-10-03/someimage.jpg";
$image = basename($var);
U need use preg_replace() and u can try use online for play with regular, it is a fast way to learn regex. http://preg_replace.onlinephpfunctions.com/
For example: /\/someimage.jpg/ replace on ''(null).
It will return http://something.com//data/images/2013-10-03 from http://something.com//data/images/2013-10-03/someimage.jpg.
You can use Simple HTML DOM Parser to get href between the a tags.
For example:
foreach($html->find('a.[class="your class"]') as $var)
// echo "href." >sometext";
hope this helps!
I'm trying to parse a direct link out of a javascript function within a page. I'm able to parse the html info I need, but am stumped on the javascript part. Is this something that is achievable with php and possibly regex?
function videoPoster() {
document.getElementById("html5_vid").innerHTML =
"<video x-webkit-airplay='allow' id='html5_video' style='margin-top:"
+ style_padding
+ "px;' width='400' preload='auto' height='325' controls onerror='cantPlayVideo()' "
+ "<source src='http://video-website.com/videos/videoname.mp4' type='video/mp4'>";
}
What I need to pull out is the link "http://video-website.com/videos/videoname.mp4". Any help or pointers would be greatly appreciated!
/http://.*\.mp4/ will give you all characters between http:// and .mp4, inclusive.
See it in action.
If you need the session id, use something like /http://.*\.mp4?sessionid=\d+/
In general, no. Nothing short of a full javascript parser will always extract urls, and even then you'll have trouble with urls that are computed nontrivially.
In practice, it is often best to use the simplest capturing regexp that works for the code you actually need to parse. In this case:
['"](http://[^'"]*)['"]
If you have to enter that regexp as a string, beware of escaping.
If you ever have unescaped quotation marks in urls, this will fail. That's valid but rare. Whoever is writing the stuff you're parsing is unlikely to use them because they make referring to the urls in javascript a pain.
For your specific case, this should work, provided that none of the characters in the URL are escaped.
preg_match("/src='([^']*)'/", $html, $matches);
$url = $matches[1];
See the preg_match() manual page. You should probably add error handling, ensuring that the function returns 1 (that the regex matched) and possibly performing some additional checks as well (such as ensuring that the URL begins with http:// and contains .mp4?).
(As with all Web scraping techniques, the owner or maintainer of the site you are scraping may make a future change that breaks your script, and you should be prepared for that.)
The following captures any url in your html
$matches=array();
if (preg_match_all('/src=["\'](?P<urls>https?:\/\/[^"\']+)["\']/', $html, $matches)){
print_r($matches['urls']);
}
if you want to do the same in javascript you could use this:
var matches;
if (matches=html.match(/src=["'](https?:\/\/[^"']+)["']/g)){
//gives you all matches, but they are still including the src=" and " parts, so you would
//have to run every match again against the regex without the g modifier
}
I have a string that contains a lot of links and I would like to adjust them before they are printed to screen:
I have something like the following:
replace_this
and would like to end up with something like this
replace this
Normally I would just use something like:
echo str_replace("_"," ",$url);
In in this case I can't do that as the URL contains underscores so it breaks my links, the thought was that I could use regular expression to get around this.
Any ideas?
Here's the regex: <a(.+?)>.+?<\/a>.
What I'm doing is preserving the important dynamic stuff within the anchor tag, and and replacing it with the following function:
preg_replace('/<a(.+?)>.+?<\/a>/i',"<a$1>REPLACE</a>",$url);
This will cover most cases, but I suggest you review to make sure that nothing unexpected was missed or changed.
pattern = "/_(?=[^>]*<)/";
preg_replace($pattern,"",$url);
You can use this regular expression
(>(.*)<\s*/)
along with preg_replace_callback .
EDIT :
$replaced_text = preg_replace_callback('~(>(.*)<\s*/)~g','uscore_replace', $text);
function uscore_replace($matches){
return str_replace('_','',$matches[1]); //try this with 1 as index if it fails try 0, I am not entirely sure
}
I think I am right in asuming that RegEx can do this job, I'm just not sure how I would do it!
Basically I have a number of links on my website that are in the format of:
Example
I need some code that will transform the href value so that it gets outputed in lowercase, but that does not affect the anchor text . E.g:
Example
Is this possible? And if so, what would be the code to do this?
you can use preg_replace_callback
something like that
function replace($match){
return strtolower($matches[0])
}
...
preg_replace_callback('/(href="[^"]*")/i' 'replace',$str);
Using preg_match and strtolower functions
preg_match('/\<a(.*)\>(.*)\<\/a\>/i',$cadena, $a);
$a[1]=strtolower($a[1]);
$cadena = preg_replace('/\<a(.*)\>(.*)\<\/a\>/i',$a[1],$cadena);
echo $cadena;
Regards!
I need to preg_match for
src="http:// "
where the blank space following // is the rest of the url ending with the ". My adapted doesn't seem to work:
preg_match('#src="(http://[^"]+)#', $data, $match);
And I am also struggling to get text that starts with > and ends with EITHER a full stop . or an exclamation mark ! or a question mark ? I have no idea how to do this one. An example of the text I want to preg_match for is:
blahblahblah>Hello world this is what I want.
I'm hoping a kind preg_match guru can tell me the answer and save me hours of headscratching.
Thanks for reading.
As for the URL:
preg_match('#src="(.*?)"#', $data, $match);
and for the second case, use />(.*?)(\.|!|\?)/
(.*?)" will match any character greedily up until the time it sees the end double quote
It seems that you want to parse a document or string which follows a HTML, DOM, XML or something similiar structure.
Use XPath, and parse to the Tag and let it return the src Attribute, this will save much trouble and you can forget about regular expressions.
Example: CLICK ME