Read page source using PHP with primes " - php

I am trying to read the source code of a page. I just want to read some text that is within a certain division element with the id "wrapper_left".
My problem is that if a prime " is used in the first argument of the explode function, it does not work. I tried escaping the string, although I figured this wouldn't do anything.
$source_code = htmlspecialchars(file_get_contents('http://mydomain.com'));
$source_code = explode('<div id="wrapper_left">', $source_code);
echo $source_code[1];
Thanks tons in advance.

Don't bother trying to get this done with explode(), string manipulation, or a regular expression, you need an HTML parser, like DOMDocument:
$doc = new DOMDocument;
$doc->loadHTMLFile( 'http://mydomain.com');
$xpath = new DOMXPath( $doc);
$div = $xpath->query( '//div[#id="wrapper_left"]')->item(0);
echo $div->textContent;
You can see it working in this demo, which, when fed this HTML:
<div id="wrapper_left">Some text</div>
It produces:
Some text

Related

Find the count of particular non alpha numeric character between the html tags by using PHP preg functions

I have a part of HTML string like below which I get from web page scraping.
$scraping_html = "<html><body>
....
<div id='ctl00_ContentPlaceHolder1_lblHdr'>some text here with &. some text here.</div>
....</body></html>";
I want to take count of & between the particular div by using PHP. Is it possible to get using any of the PHP preg functions? Thanks in advance.
The hard part is getting the text nodes (I assume that's where you're stuck). Depending on how reliable it needs to be you have two alternatives (just sample code, not actually tested):
Good old strip_tags():
$plain_text = strip_tags($scraping_html);
Proper DOM parser:
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($scraping_html);
libxml_use_internal_errors(false);
$xpath = new DOMXPath($dom);
$plain_text = '';
foreach ($xpath->query('//text()') as $textNode) {
$plain_text .= $textNode->nodeValue;
}
To count, you have e.g. substr_count().
To get the number of & in the given example, use DOMDocument:
$html = <<<EOD
<html><body>
<div id='ctl00_ContentPlaceHolder1_lblHdr'>some text here with &. some text here.</div>
</body></html>
EOD;
$dom = new DOMDocument;
$dom->loadHTML($html);
$ele = $dom->getElementById('ctl00_ContentPlaceHolder1_lblHdr');
echo substr_count($ele->nodeValue, '&');

How do I use str_replace with DomDocument

I am using DomDocument to pull content from a specific div on a page.
I would then like to replace all instances of links with a path equal to http://example.com/test/ with http://example.com/test.php.
$url = "http://pugetsoundbasketball.com/stackoverflow_sample.php";
$doc = new DomDocument('1.0', 'UTF-8');
$doc->loadHtml(file_get_contents($url));
$div = $doc->getElementById('upcoming_league_dates');
foreach ($div->getElementsByTagName('a') as $item) {
$item->setAttribute('href', 'http://example.com/test.php');
}
echo $doc->saveHTML($div);
As you can see in the example above, str_replace causes problems after I target the upcoming_league_dates div with getElementById. I understand this but unfortunately I don't know where to go from here!
I've tried several different ways including executing the str_replace above the getElementById function (I figured I could replace the strings first and then target the specific div), with no luck.
What am I missing here?
EDIT: UPDATED CODE TO SHOW WORKING SOLUTION
You can't just use str_replace on that node. You need to access it properly first. Thru the DOMElement class you can use the method ->setAttribute() and make the replacement.
Example:
$url = "http://pugetsoundbasketball.com/stackoverflow_sample.php";
$dom = new DOMDocument('1.0', 'UTF-8');
$dom->loadHTMLFile($url);
$xpath = new DOMXpath($dom); // use xpath
$needle = 'http://example.com/test/';
$replacement = 'http://example.com/test.php';
// target the link
$links = $xpath->query("//div[#id='upcoming_league_dates']/a[contains(#href, '$needle')]");
foreach($links as $anchor) {
// replacement of those href values
$anchor->setAttribute('href', $replacement);
}
echo $dom->saveHTML();
Update: After your revision, your code is now working anyway. This is just to answer your logic replacement (ala str_replace search/replace) on your previous question.

Preg_replace, point after and before selection

<div style="display:none">250</div>.<div style="display:none">145</div>
id want:
<div style="display:none">250</div>#.#<div style="display:none">145</div>
or like this:
<div style="display:none">111</div>125<div style="display:none">110</div>
where id want
<div style="display:none">111</div>#125#<div style="display:none">110</div>
id like a preg replace to put those hashtags around the numb, so i asume the REGEX would look something like this:
"<\/div>[.]|<\/div>\d{1,3}"
The digit (in case its a digit, can be 1-3 digits), or it can be a dot.
Anyhow, i dont know hot to preg replace around the value:
"<\/div>[.]|<\/div>\d{1,3}" replace: $0#
Inserts it after the value..
EDIT
I cannot use a HTML parser, because i cannot find one that does not threat styles / classes as plaintext, and i need the values attached, to determine if the element is visible or not :(
and yes, it is driving me insane, but i am almost done :)
You really should not be trying to parse HTML with regex. There are only a couple of people I know who can do it. And even if you would have been one of them regex still is not the right tool for the job. Use PHP's DOMDocument optionally with DOMXPath.
With xpath:
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$textNode = $xpath->query('//text()')->item(1);
$textNode->parentNode->replaceChild($dom->createTextNode('#' . $textNode->textContent . '#'), $textNode);
echo htmlspecialchars($dom->saveHTML());
http://codepad.viper-7.com/KLTLDA
With childnodes:
$dom = new DOMDocument();
$dom->loadHTML($html);
$body = $dom->getElementsByTagName('body')->item(0);
$textNode = $body->childNodes->item(1);
$textNode->parentNode->replaceChild($dom->createTextNode('#' . $textNode->textContent . '#'), $textNode);
echo htmlspecialchars($dom->saveHTML());
http://codepad.viper-7.com/Ii4vPb
In your case,
preg_replace("~</div\s*>(\.|\d{1,3})<div~i", '</div>#$1#<div', $string);
That's assuming no spaces between the divs and the content, and nothing otherwise weird is between.
Note that regex is very brittle, and would fail silently on even the slightest change in HTML.

PHP - How to replace a phrase with another?

How can i replace this <p><span class="headline"> with this <p class="headline"><span>
easiest with PHP.
$data = file_get_contents("http://www.ihr-apotheker.de/cs1.html");
$clean1 = strstr($data, '<p>');
$str = preg_replace('#(<a.*>).*?(</a>)#', '$1$2', $clean1);
$ausgabe = strip_tags($str, '<p>');
echo $ausgabe;
Before I alter the html from the site I want to get the class declaration from the span to the <p> tag.
dont parse html with regex!
this class should provide what you need
http://simplehtmldom.sourceforge.net/
The reason not to parse HTML with regex is if you can't guarantee the format. If you already know the format of the string, you don't have to worry about having a complete parser.
In your case, if you know that's the format, you can use str_replace
str_replace('<p><span class="headline">', '<p class="headline"><span>', $data);
Well, answer was accepted already, but anyway, here is how to do it with native DOM:
$dom = new DOMDocument;
$dom->loadHTMLFile("http://www.ihr-apotheker.de/cs1.html");
$xPath = new DOMXpath($dom);
// remove links but keep link text
foreach($xPath->query('//a') as $link) {
$link->parentNode->replaceChild(
$dom->createTextNode($link->nodeValue), $link);
}
// switch classes
foreach($xPath->query('//p/span[#class="headline"]') as $node) {
$node->removeAttribute('class');
$node->parentNode->setAttribute('class', 'headline');
}
echo $dom->saveHTML();
On a sidenote, HTML has elements for headings, so why not use a <h*> element instead of using the semantically superfluous "headline" class.
Have you tried using str_replace?
If the placement of the <p> and <span> tags are consistent, you can simply replace one for the other with
str_replace("replacement", "part to replace", $string);

Retrieve value of a textarea with PHP

Would anyone perhaps know how to get the value of a specific element in an HTML document with PHP? What I'm doing right now is using file_get_contents to pull up the HTML code from another website, and on that website there is a textarea:
<textarea id="body" name="body" rows="12" cols="75" tabindex="1">Hello World!</textarea>
What I want to do is have my script do the file_get_contents and just pull out the "Hello World!" from the textarea. Is that possible? Sorry for bugging you guys, again, you give such helpful advice :].
Don't be sorry for bugging us, this is a good question I'm happy to answer. You can use PHP Simple HTML DOM Parser to get what you need:
$html = file_get_html('http://www.domain.com/');
$textarea = $html->find('textarea[id=body]');
$contents = $textarea->innertext;
echo $contents; // Outputs 'Hello World!'
If you want to use file_get_contents(), you can do it like this:
$raw_html = file_get_contents('http://www.domain.com/');
$html = str_get_html($raw_html);
...
Although I don't see any need for the file_get_contents() as you can use the outertext method to get the original, full HTML out if you need it somewhere:
$html = file_get_html('http://www.domain.com/');
$raw_html = $html->outertext;
Just for the kicks, you can do this also with an one-liner regular expression:
preg_match('~<textarea id="body".*?>(.*?)</textarea>~', file_get_contents('http://www.domain.com/'), $matches);
echo $matches[1][0]; // Outputs 'Hello World!'
I'd strongly advise against this though as you are a lot more vulnerable to code changes which might break this regular expression.
I'd suggest using PHPs DOM & DOMXPath classes.
$dom = DOMDocument::loadHTMLFile( $url );
$xpath = new DOMXPath( $dom );
$nodes = $xpath->query('//textarea[id=body]' )
$result = array();
for( $nodes as $node ) {
$result[] = $node->textContent;
}
There $result would contain the value of every textarea with id body.

Categories