I want the file_get_contents every single link in my array. Therefore, i Can apply a preg_match code which will then match all the first 20 characters in the first p tags detected.
my code is below:
$links = array(0 => "http://en.wikipedia.org/wiki/The_Big_Bang_Theory", 1=> "http://en.wikipedia.org/wiki/Fantastic_Four");
print_r($links);
$link = implode(", " , $links);
$html = file_get_contents($link);
preg_match('%(<p[^>]*>.*?</p>)%i', $html, $re);
$res = get_custom_excerpt($re[1]);
echo $res;
you can use one url at a time in file_get_contents. Combining the links will not work instead you need to loop the each link and get the content.
$links = array(0 => "http://en.wikipedia.org/wiki/The_Big_Bang_Theory", 1=> "http://en.wikipedia.org/wiki/Fantastic_Four");
print_r($links);
foreach($links as $link){
$html = file_get_contents($link);
preg_match('%(<p[^>]*>.*?</p>)%i', $html, $re);
$res = get_custom_excerpt($re[1]);
echo $res;
}
Why not use their API?
You can still use file_get_contents to retrieve the, well, contents, but you can decide on a format more suitable to your needs.
Their API is documented quite well, see: http://www.mediawiki.org/wiki/API:Main_page
URLs will transform into something
Related
Hi. I have a string that looks like this:
55650-vaikospinta-54vnt-lape.pdf
I'm trying to pull URL out with PHP, I want result like this:
https://website.com/c4ca4238a0b923820dcc509a6f75849b/2020/11/55650-vaikospinta-54vnt-lape.pdf
Things I've tried:
From another StackOverflow question, I tried this:
$a = new SimpleXMLElement($FileURL);
$file = 'SimpleXMLElement.txt';
file_put_contents($file, $a);
But result I get is just the string in between and , this:
55650-vaikospinta-54vnt-lape.pdf
Also from another StackOverflow question, I tried using preg_match, like this:
$file = 'preg_match.txt';
preg_match_all('/<a[^>]+href=([\'"])(?<href>.+?)\1[^>]*>/i', $FileURL, $result);
if (!empty($result)) {
# Found a link.
file_put_contents($file, $result);
}
I have no idea how regex works (assuming that's regex), but the result I get is just...:
ArrayArrayArrayArray
Thanks for any help!
You can use DOMDocument with loadHtml and getElementsByTagName as below
$str = '55650-vaikospinta-54vnt-lape.pdf
';
$doc = new DOMDocument();
$d=$doc->loadHtml($str);
$a = $doc->getElementsByTagName('a');
foreach ($a as $vals) {
$href = $vals->getAttribute('href');
print_r($href); PHP_EOL;
}
if you dont want to use foreach then u can use as $href = $a[0]->getAttribute('href');
Result will be
https://website.com/c4ca4238a0b923820dcc509a6f75849b/2020/11/55650-vaikospinta-54vnt-lape.pdf
If you insist using regular expression, ie. regex, this works:
<?php
$your_var = '55650-vaikospinta-54vnt-lape.pdf';
preg_match('/<a[^>]+href=([\'"])(?<href>.+?)\1[^>]*>/i', $your_var, $result);
$url = $result[2];
echo "Your URL: $url";
For example, you can validate your regex online: https://regex101.com/
XPath way:
$href = (string) simplexml_load_string($html)->xpath('//a/#href')[0]->href;
i am pretty sure that my title maybe confused you but i have a question. simply i am parsing all the title names from the $url and then i print them...it works just fine Question: but what if i didn't want to show me the first title name and the third? Is it possible to right code to the foreach and say for example don't get the first[0] and the third[2] but take all the others title names. If yes or this is answered please redirect me cause i couldn't find something thanx.
This is my code below..
include 'lib/simple_html_dom.php';
$url="http://hallofbeorn.com/LotR?CardSet=The+Hunt+for+Gollum";
$html=file_get_html($url);
$array = [];
foreach ($html->find('a[style="margin-bottom:2px;font-size:medium;font-weight:bold;display:inline-
block;"]') as $values) {
$array[] = $values->plaintext;
}
print_r($array);
i know that i can do it with that way: print_r($array[1]); print_r($array[3]); print_r($array[4]);.............etc but i am asking if there is a faster way inside the foreach
You should look at regex.
Try this:
$url="http://hallofbeorn.com/LotR?CardSet=The+Hunt+for+Gollum";
$html=file_get_contents($url);
$pattern = '/<a href="(.*)" style="margin-bottom:2px;font-size:medium;font-weight:bold;display:inline-block;">(.*)<\/a>/m';
preg_match_all($pattern, $html, $matches);
print_r($matches[2]);
A simple if statement can help you:
foreach ($html->find('a[style="margin-bottom:2px;font-size:medium;font-weight:bold;display:inline-block;"]') as $i => $values) {
if($i != 0 && $i != 2) {
$array[] = $values->plaintext;
}
}
print_r($array);
You can use regex to fetch the data.
$url="http://hallofbeorn.com/LotR?CardSet=The+Hunt+for+Gollum";
$html=file_get_contents($url);
$pattern = '/(?P<cards><a href=".*" style="margin-bottom:2px;font-size:medium;font-weight:bold;display:inline-block;">.*<\/a>)/';
preg_match_all($pattern, $html, $matches);
header('content-type: text/plain; charset=utf-8');
print_r($matches);
I've been trying to use preg_match_all for 30 minutes but it looks like I can't do it.
Basically I have a $var which contains a string of HTML code. For example:
<br>iihfuhuf
<img title="Image: http://www.jlnv2.local/temp/temp513caca536fcd.jpeg"
src="http://www.jlnv2.local/temp/temp513caca536fcd.jpeg">
<img src="http://www.jlnv2.local/temp/temp513caca73b8da.jpeg"><br>
I want to get the src attribute values of img tags that contain /temp/temp[a-z0-9]{13}\.jpeg in their src value.
This is what I have so far:
preg_match_all('!(<img.*src=".*/temp/temp[a-z0-9]{13}\.jpeg"(.*alt=".*")?>)!', $content, $matches);
<img[^>]*src="([^"]*/temp/temp[a-z0-9]{13}\.jpeg)"
<img[^>]* Select IMG tags
src="([^"]*)" gets src value and save it as a match
/temp/temp[a-z0-9]{13}\.jpeg is the filter for src values
For quick RegEx tests use some online tool like http://regexpal.com/
All you need to do is add another group to your regular expression. You have du surround everything you want to extract from the match with braces:
preg_match_all('!(<img.*src="(.*/temp/temp[a-z0-9]{13}\.jpeg)"(.*alt=".*")?>)!', $content, $matches);
You can see that working here. You can find the URLs in $matches[2].
But just for having said it: Regular expressions are no reasonable approach to extract anything from HTML. You would be better off using DOMDocument, XPath or something along that line.
Try this:
preg_match_all('/src="([^"]+temp[a-z0-9]{13}\.jpeg)"/',$url,$matches);
var_dump($matches);
<?php
$text = '<br>iihfuhuf<img title="Image: http://www.jlnv2.local/temp/temp513caca536fcd.jpeg" src="http://www.jlnv2.local/temp/temp513caca536fcd.jpeg"><img src="http://www.jlnv2.local/temp/temp513caca73b8da.jpeg"><br>';
$pattern = '#src="([^"]+/temp/temp[a-z0-9]{13}\.jpeg)"#';
preg_match_all($pattern, $text, $out);
echo '<pre>';
print_r($out);
?>
Array
(
[0] => Array
(
[0] => src="http://www.jlnv2.local/temp/temp513caca536fcd.jpeg"
[1] => src="http://www.jlnv2.local/temp/temp513caca73b8da.jpeg"
)
[1] => Array
(
[0] => http://www.jlnv2.local/temp/temp513caca536fcd.jpeg
[1] => http://www.jlnv2.local/temp/temp513caca73b8da.jpeg
)
)
Here is a DOMDocument/DOMXPath based example of how to do it. This is arguably the only right way to do it, because unless you are really good at regular expressions there will most likely always be edge cases that will break your logic.
$doc = new DOMDocument;
$xpath = new DOMXPath($doc);
$doc->loadHTML($content);
$candidates = $xpath->query("//img[contains(#src, '/temp/temp')]");
$result = array();
foreach ($candidates as $image) {
$src = $image->getAttribute('src');
if (preg_match('/temp[0-9a-z]{13}\.jpeg$/', $src, $matches)) {
$result[] = $src;
}
}
print_r($result);
$text = '<br>iihfuhuf<img title="Image: http://www.jlnv2.local/temp/temp513caca536fcd.jpeg" src="http://www.jlnv2.local/temp/temp513caca536fcd.jpeg"><img src="http://www.jlnv2.local/temp/temp513caca73b8da.jpeg"><br>';
$pattern = '#src="([^"]+/temp/temp[a-z0-9]{13}\.jpeg)"#';
preg_match( '#src="([^"]+)"#' , $text, $match );
$src = array_pop($match);
echo $src;
I have links on some pages that use an old system such as:
<a href='/app/?query=stuff_is_here'>This is a link</a>
They need to be converted to the new system which is like:
<a href='/newapp/?q=stuff+is+here'>This is a link</a>
I can use preg_replace t0 change some of what i need to, but i also need to replace underscores in the query with +'s instead. My current code is:
//$content is the page html
$content = preg_replace('#(href)="http://www.site.com/app/?query=([^:"]*)(?:")#','$1="http://www.site.com/newapp/?q=$2"',$content);
What I want to do is run str_replace on the $2 variable, so I tried using preg_replace_callback, and could never get it to work. What should I do?
You have to pass a valid callback [docs] as second parameter: a function name, an anonymous function, etc.
Here is an example:
function my_replace_callback($match) {
$q = str_replace('_', '+', $match[2]);
return $match[1] . '="http://www.site.com/newapp/?q=' . $q;
}
$content = preg_replace_callback('#(href)="http://www.site.com/app/?query=([^:"]*)(?:")#', 'my_replace_callback', $content);
Or with PHP 5.3:
$content = preg_replace_callback('#(href)="http://www.site.com/app/?query=([^:"]*)(?:")#', function($match) {
$q = str_replace('_', '+', $match[2]);
return $match[1] . '="http://www.site.com/newapp/?q=' . $q;
}, $content);
You may also want to try with a HTML parser instead of a regex: How do you parse and process HTML/XML in PHP?
Parsing your document with dom, searching for all "a" tags and then replacing could be a good way. Someone already commented posting you this link to show you that regex isn't always the best way to work with html.
Ayways this code should work:
<?php
$dom = new DOMDocument;
//html string contains your html
$dom->loadHTML($html);
?><ul><?
foreach( $dom->getElementsByTagName('a') as $node ) {
//look for href attribute
if( $node->hasAttribute( 'href' ) ) {
$href = $node->getAttribute( 'href' );
// change hrefs value
$node->setAttribute( "href", preg_replace( "/\/app\/\?query=(.*)/", "/newapp/?q=\1", $href ) );
}
}
//save new html
$newHTML = $dom->saveHTML();
?>
Notice that i did this with preg_replace but this can be done with str_ireplace or str_replace
$newHref = str_ireplace("/app/?query=", "/newapp/?q=", $href);
Or you can use simply preg_match() and collect matched strings. Then apply str_replace() to one of the matches and replace "+" to "_".
$content = preg_match('#href="\/[^\/]\/\?query=([^:"]+)#', $matches)
$matches[2] = 'newapp';
$matches[4] = str_replace('_', '+', $matches[4]);
$result = implode('', $matches)
Pass arrays to preg_replace as pattern and replacement:
preg_replace(array('|/app/|', '_'), array('/newappp/', '+'), $content);
Hi I'm trying to implement a screen scraping scenario on my website and have the following set so far. What I'm ultimately trying to do is replace all links in the $results variable that have "ResultsDetails.aspx?" to "results-scrape-details/" then output again. Can anyone point me in the right direction?
<?php
$url = "http://mysite:90/Testing/label/stuff/ResultsIndex.aspx";
$raw = file_get_contents($url);
$newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");
$content = str_replace($newlines, "", html_entity_decode($raw));
$start = strpos($content,"<div id='pageBack'");
$end = strpos($content,'</body>',$start) + 6;
$results = substr($content,$start,$end-$start);
$pattern = 'ResultsDetails.aspx?';
$replacement = 'results-scrape-details/';
preg_replace($pattern, $replacement, $results);
echo $results;
Use a DOM tool like PHP Simple HTML DOM. With it you can find all the links you're looking for with a Jqueryish syntax.
// Create DOM object from HTML source
$dom = file_get_html('http://www.domain.com/path/to/page');
// Iterate all matching links
foreach ($dom->find('a[href^=ResultsDetails.aspx') as $node) {
// Replace href attribute value
$node->href = 'results-scrape-detail/';
}
// Output modified DOM
echo $dom->outertext;
The ? char has special meaning in regexes - either escape it and use the same code or replace the preg_replace with str_ireplace() (I'd recommend the latter approach as it is also more efficient).
(and should the html_entity_decode call really be there?)
C.