I am getting text between two tags with PHP (from a HTML).
a sample code i use is this :
function GDes($url) {
$fp = file_get_contents($url);
if (!$fp) return false;
$res = preg_match("/<description>(.*)<\/description>/siU", $fp, $title_matches);
if (!$res) return false;
$description = preg_replace('/\s+/', ' ', $title_matches[1]);
$description = trim($description);
return $description;
}
It gives between the description tags, But my problem is that if the page have to description tags, it will give the first one that i don't need it.
I need to get the second one.
For example, If my HTML is this :
<description>No need to this</description>
<description>I NEED THIS ONE</description>
I need to give the second description tag with that function above.
What changes the function needed ?
Use preg_match_all instead. It will create an array with all matches.
You can keep your code as is, just replace preg_match with preg_match_all.
Then you have to use $title_matches[1][1] instead of $title_matches[1] in your preg_replace call, since the $title_matches is now a multidimensional array.
Related
I am making a price crawler for a project but am running into a bit of an issue. I am using the below code to extract values from an html page:
$content = file_get_contents($_POST['url']);
$resultsArray = array();
$sqlresult = array();
$priceElement = explode( '<div>value I want to extract</div>' , $content );
Now when I use this to get certain elements I only get back
Finance: {{value * value2}}
I want to get the actual value that would be displayed on the screen e.g
Finance: 7.96
The other php methods I have tried are:
curl
file_get_html(using simple_html_dom library)
None of these work either :( Any ideas what I can do?
You just set the <div>value I want to extract</div> as a delimiter, which means PHP looks for it to separate your string to array whenever this occurs.
In the following code we use , character as a delimiter:
<?php
$string = "apple,banana,lemon";
$array = explode(',', $string);
echo $array[1];
?>
The output should be this:
banana
In your example you set the value you want to extract as a delimiter. That's why this happens to you. You'll need to set a delimiter between your string you want to obtain and other string you won't need at the moment.
For example:
<?php
$string = "iDontNeedThis-dontExtractNow-value I want to extract-dontNeedEither";
$priceElement = explode('-', $string);
echo "<div>".$priceElement[2]."</div>";
?>
The code should output this to your HTML page:
<div>value I want to extract</div>
And it will appear on your page like this:
value I want to extract
If you don't need to save the whole array in a variable, you can save the one index of it to variable instead:
$priceElement = explode('-', $string)[2];
echo $priceElement;
This will save only value I want to extract so you won't have to deal with arrays later on.
I am using
preg_match("/\<title\>(.*)\<\/title\>/i",$str,$title);
to get
<title>Exapmle</title>
How can I get this using preg_match?
<title id="ANY_ID">Exapmle</title>
I am using this to get page title. Maybe, a title can have more than 'id'
While answering, please keep that in mind.
Okey. Here is a complete code to get page title. Hope that it helps others.
function get_title($url){
$str = file_get_contents($url);
$str = trim(preg_replace('/\s+/', ' ', $str));
// supports line breaks inside <title>
$match = preg_match("/\<title(.*)\>(.*)\<\/title\>/i",$str,$title);
// tries to catch if the title has an id or more
// if first prag_match gets an error
// (that means the title tag has no id or more)
// it tries to get title without id or more
if ($match === false)
{
preg_match("/\<title\>(.*)\<\/title\>/i",$str,$title);
}
return $title[1];
}
You can add (.*) which match any characters. Note that last element of $title is your title
preg_match("/\<title(.*)\>(.*)\<\/title\>/i",$str,$title);
I'm trying to simulate a bbcode tag, like code below:
[code]this is code to render[/code]
[code attributeA=arg]this is code to render[/code]
[code attribute C=arg anotherAtributte=anotherArg]this is code to render[/code]
As you can see, the code tag can take as many attributes as needed, also could exists too many code tags in the same "publishment". I only have dealed with easiest tags like img, b, a, i. For example:
$result = preg_replace('#\[link\=(.+)\](.+)\[\/link\]#iUs', '$2', $publishment);
That works fine since it returns the final markup. But, in the code tag I need to have the "attributes" and "values" in array in order to build the markup myselft according to these attributes in order to simulate someting like this:
$code_tag = someFunction("[code ??=?? ...] content [/code]", $array );
//build the markup myself
$attribute1 = array_contains("attribute1", $array)? $array["attribute1"] : "";
echo '<pre {$attribute1}>' . $array['content'] . </pre>
So, I don't expect that you do it entirely for me, I need you just help to take me to the right direction because I never have used regex.
Thank you in advance
I like to use preg_replace_callback for such things:
function codecb($matches)
{
$original=$matches[0];
$parameters=$matches[1];
$content=$matches[2];
return "<pre>". $content ."</pre>";
}
preg_replace_callback("#\[code(.*)\](.+)\[\/code\]#iUs", "codecb", $str);
so when you have [code argA=test argB=test]This is content[/code] then in the function "codecb" you will have:
$original = "[code argA=test argB=test]This is content[/code]"
$parameters = " argA=test argB=test"
$content = "This is content"
and can preg_match the arguments and return the replacement for the whole.
Below is a link crawler that gets the urls of a page in a given depth. At the end of it I added a regular expression to match all the emails of the url that is just crawled. As you can see in the second part, it file_get_content the same page it just downloaded, meaning twice the execution time, bandwidth etc.
The question is how can I merge those two parts to use the first downloaded page, to avoid getting it again? Thank you.
function crawler($url, $depth = 2) {
$dom = new DOMDocument('1.0');
if (!$parts || !#$dom->loadHTMLFile($url)) {
return;
}
.
.
.
//this is where the second part starts
$text = file_get_contents($url);
$res = preg_match_all("/[a-z0-9]+([_\\.-][a-z0-9]+)*#([a-z0-9]+([\.-][a-z0-9]+)*)+\\.[a-z]{2,}/i", $text, $matches);
}
Replace:
$text = file_get_contents($url);
with:
$text = $dom->saveHTML();
http://www.php.net/manual/en/domdocument.savehtml.php
Alternatively, in the first part of your function, you could save the HTML into a variable using file_get_contents, then pass it to $dom->loadHTML. That way you can then reuse the variable with your regex.
http://www.php.net/manual/en/domdocument.loadhtml.php
I need to find the number of indexed pages in google for a specific domain name, how do we do that through a PHP script?
So,
foreach ($allresponseresults as $responseresult)
{
$result[] = array(
'url' => $responseresult['url'],
'title' => $responseresult['title'],
'abstract' => $responseresult['content'],
);
}
what do i add for the estimated number of results and how do i do that?
i know it is (estimatedResultCount) but how do i add that? and i call the title for example this way: $result['title'] so how to get the number and how to print the number?
Thank you :)
I think it would be nicer to Google to use their RESTful Search API. See this URL for an example call:
http://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=site:stackoverflow.com&filter=0
(You're interested in the estimatedResultCount value)
In PHP you can use file_get_contents to get the data and json_decode to parse it.
You can find documentation here:
http://code.google.com/apis/ajaxsearch/documentation/#fonje
Example
Warning: The following code does not have any kind of error checking on the response!
function getGoogleCount($domain) {
$content = file_get_contents('http://ajax.googleapis.com/ajax/services/' .
'search/web?v=1.0&filter=0&q=site:' . urlencode($domain));
$data = json_decode($content);
return intval($data->responseData->cursor->estimatedResultCount);
}
echo getGoogleCount('stackoverflow.com');
You'd load http://www.google.com/search?q=domaingoeshere.com with cURL and then parse the file looking for the results <p id="resultStats" bit.
You'd have the resulting html stored in a variable $html and then say something like
$arr = explode('<p id="resultStats"'>, $html);
$bottom = $arr[1];
$middle = explode('</p>', $bottom);
Please note that this is untested and a very rough example. You'd be better off parsing the html with a dedicated parser or matching the line with regular expressions.
google ajax api estimatedResultCount values doesn't give the right value.
And trying to parse html result is not a good way because google blocks after several search.
Count the number of results for site:yourdomainhere.com - stackoverflow.com has about 830k
// This will give you the count what you see on search result on web page,
//this code will give you the HTML content from file_get_contents
header('Content-Type: text/plain');
$url = "https://www.google.com/search?q=your url";
$html = file_get_contents($url);
if (FALSE === $html) {
throw new Exception(sprintf('Failed to open HTTP URL "%s".', $url));
}
$arr = explode('<div class="sd" id="resultStats">', $html);
$bottom = $arr[1];
$middle = explode('</div>', $bottom);
echo $middle[0];
Output:
About 8,130 results
//vKj
Case 2: you can also use google api, but its count is different:
https://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=ursitename&callback=processResults
https://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=site:google.com
cursor":{"resultCount":"111,000,000","
"estimatedResultCount":"111000000",