removing href using phpquery problem - php

Hi I need to remove the link of a screen scrap site here is the output source.
<div class="FourDayForecastContainerInner">
<span class="day">Friday</span>
<a href="forecastPublicExtended.asp#Period4" target="_blank">
<img src="./images/wimages/b_rain.gif" class="thumbnail">
</a>
<span class="hi">
<span style="width:24px;">Hi</span>
19 / 66
</span>
<span class="lo">
<span style="width:24px;">Lo</span>
16 / 60
</span>
<span class="description">
Sunny Breaks, showers
</span>
</div>
<div class="FourDayForecastContainerInner">
<span class="day">Saturday</span>
and here is my code Im using phpquery
$doc = phpQuery::newDocumentHTML( $e );
$containers = pq('.FourDayForecastContainerInner', $doc);
foreach( $containers as $container ) {
$div = pq('span', $container);
$img = pq('img', $container);
$div->eq(0)
->removeAttr('style')
->addClass('day')
->html(
pq( 'u', $div->eq(0) )
->html()
);
$img->eq(0)
->removeAttr('style')
->removeAttr('height')
->removeAttr('width')
->removeAttr('alt')
->addClass('thumbnail')
->html( pq( 'img', $img->eq(0)) );
$div->eq(1)
->removeAttr('style')
->addClass('hi');
$div->eq(3)
->removeAttr('style')
->addClass('lo');
$div->eq(5)
->removeAttr('style')
->addClass('description');
}
print $doc;
I have manage to remove all attributes styles height width etc. but i can't seem to remove the a href
thank you so much for your help

I tried it with your sample code and it works. This is the output
<div class='FourDayForecastContainerInner'>
<span class='day'>Friday</span>
<img src='./images/wimages/b_rain.gif' class='thumbnail'>
<span class='hi'>
<span style='width:24px;'>Hi</span>
19 / 66
</span>
<span class='lo'>
<span style='width:24px;'>Lo</span>
16 / 60
</span>
<span class='description'>
Sunny Breaks, showers
</span>
</div>
<div class='FourDayForecastContainerInner'>
<span class='day'>Saturday</span><div class='FourDayForecastContainerInner'>
<span class='day'>Friday</span>
<img src='./images/wimages/b_rain.gif' class='thumbnail'>
<span class='hi'>
<span style='width:24px;'>Hi</span>
19 / 66
</span>
<span class='lo'>
<span style='width:24px;'>Lo</span>
16 / 60
</span>
<span class='description'>
Sunny Breaks, showers
</span>
</div>
<div class='FourDayForecastContainerInner'>
<span class='day'>Saturday</span>
The way you are doing is too long and tiresome. Use regular expressions to replace the link.

$html = 'Your HTMl CODE HERE';
$exp = "~<a.*>~isU";
$html = preg_replace($exp,"", $html);
$exp = "~</a>~isU";
$html = preg_replace($exp,"", $html);
echo $html
this will totally remove the link

Does the following code do what you want ? (when added at the end of 'for' loop)
$imghtml = pq('a', $container)->html();
pq($container)->prepend($imghtml);
pq('a', $container)->remove();
Note : phpquery doesn't seem to support jquery detach()

I ran into the same question and I wanted to share my solution. My goal was to remove all tags from the title portion of some SoundCloud embed code. The HTML looked like this:
<object height="81" width="100%">
... a bunch of embed code ...
</object>
<span>
Mike Ink _ Silver
by
MINIMAL
</span>
At the end of the HTML above, you can see that the title has not only one but two links around it. My goal was to strip those out.
Assuming that HTML is assigned to the PHP variable $text, here's how I did it:
$doc = phpQuery::newDocument($text);
$soundcloud_title = strip_tags((string) $doc->find('span'));
print($soundcloud_title);
// outputs: Mike Ink _ Silver by MINIMAL
I know that this doesn't directly answer the question. In fact, I'm using strip_tags to remove the links instead of using phpquery, but I hoped it might help other coders who are looking for the same answers I was.
Happy coding!

Related

PHP replace image src and add a new attribute in image tag from a string containing different html tags [duplicate]

This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 3 years ago.
I have a site where i get products description from database and decode html like this in PHP and display it on webpage frontend:
$data['description'] = html_entity_decode($product_info['description'], ENT_QUOTES, 'UTF-8');
It returns html like the following:
<div class="container">
<div class="textleft">
<p>
<span style="font-size:medium">
<strong>Product Name:</strong>
</span>
<br />
<span style="font-size:14px">Some description here Click here to see full details.</span>
</p>
</div>
<div class="imageblock">
<a href="some-link">
<img src="http://myproject.com/image/catalog/image1.jpg" style="width: 500px; height: 150px;" />
</a>
</div>
<div style="clear:both">
</div>
<div class="container">
<div class="textleft">
<p>
<span style="font-size:medium">
<strong>Product Name:</strong>
</span>
<br />
<span style="font-size:14px">Some description here Click here to see full details.</span>
</p>
</div>
<div class="imageblock">
<a href="some-link">
<img src="http://myproject.com/image/catalog/image2.jpg" style="width: 500px; height: 150px;" />
</a>
</div>
<div style="clear:both">
</div>
There could be many images in the product description. I have added just 2 in my example. What I need to do is replace src of every image with src="image/catalog/blank.gif" for all images and add a new attribute
data-src="http://myproject.com/image/catalog/image1.jpg"
for image 1 and
data-src="http://myproject.com/image/catalog/image2.jpg"
for image 2. data-src attribute should get the original src value of each image. How can I achieve that?
I have tried preg_replace like following
$data['description'] = preg_replace('((\n)?src="\b.*?")', 'src="image/catalog/blank.gif', $data['description']);
It replaces src attribute of every image, but how can i add data-src with original image path. I need this before page load, so is there any way to do it with PHP?
Simply adjust your regular expression. Capture the text you want using (parentheses), then reference to that group 1 using $1 or \1.
preg_replace('(src="(.*?)")', 'src="image/catalog/blank.gif" data-src="$1"', $data['description']);
Demo: https://repl.it/repls/SpottedZanyDiscussion
I think this might be what you are looking for:
http://php.net/manual/en/domdocument.getelementsbytagname.php
$data['description'] = html_entity_decode($product_info['description'], ENT_QUOTES, 'UTF-8');
$doc = new DOMDocument();
$doc->loadHTML($data['description']);
$tags = $doc->getElementsByTagName('img');
foreach ($tags as $tag) {
$old_src = $tag->getAttribute('src');
$new_src_url = 'image/catalog/blank.gif';
$tag->setAttribute('src', $new_src_url);
$tag->setAttribute('data-src', $old_src);
}
$data['description'] = $doc->saveHTML();
I havn't tested this though, so don't just copy and paste.

How to wrap only words in spans php?

I have a text string mixed with html and need to separate words only and wrap them in spans.
String:
$string ='<div class="something">What </div> if it it is is same <div style="color:red;">same </div>';
desired output
<div class="something">
<span class="splits split1">
What
</span>
</div>
<span class="splits split2">
if
</span>
<span class="splits split3">
it
</span>
<span class="splits split4">
it
</span>
<span class="splits split5">
is
</span>
<span class="splits split6">
is
</span>
<span class="splits split7">
same
</span>
<div style="color:red;">
<span class="splits split8">
same
</span>
</div>
I tried everything I could think of, preg_replace with word boundary, preg_match , str_replace, combos with explodes , loops and replace, but one way or another the output fails and brakes the html or it adds replacement where it should not be.
Any help is appreciated.
As #MarcB said, here with DOM life is tastier but since question is tagged regex this could be a workaround (not %100 guaranteed though):
</?\b[^<>]+>(*SKIP)(*F)|\w+
PHP:
preg_replace_callback('~</?\b[^<>]+>(*SKIP)(*F)|\w+~', function($matches) {
static $counter = 0;
return "<span class=\"splits split".(++$counter)."\">{$matches[0]}</span>";
}, $string);
Live demo

xpath not returning text if p tag is followed by any other tag

i want to get all the text between <p> and <h3> tag for the following HTML
<div class="bodyText">
<p>
<div class="articleBox articleSmallHorizontal channel-32333770 articleBoxBordered alignRight">
<div class="one">
<img src="url" alt="bar" class="img" width="80" height="60" />
</div>
<div class="two">
<h4 class="preTitle">QIEZ-Lieblinge</h4>
<h3 class="title"><a href="url" title="ABC" onclick="cmsTracking.trackClickOut({element:this, channel : 32333770, channelname : 'top_listen', content : 14832081, callTemplate : '_htmltagging.Text', action : 'click', mouseevent : event});">
Prominente Gastronomen </a></h3>
<span class="postTitle"></span>
<span class="district">Berlin</span> </div>
<div class="clear"></div>
</div>
I want this TEXT</p>
<h3>I want this TEXT</h3>
<p>I want this TEXT</p>
<p>
<div class="inlineImage alignLeft">
<div class="medium">
<img src="http://images03.qiez.de/Restaurant+%C3%96_QIEZ.jpg/280x210/0/167.231.886/167.231.798" width="280" height="210" alt="Schöne Lage: das Restaurant Ø. (c)QIEZ"/>
<span class="caption">
Schöne Lage: das Restaurant Ø. (c)QIEZ </span>
</div>
</div>I want this TEXT</p>
<p>I want this TEXT</p>
<p>I want this TEXT<br /> </p>
<blockquote><img src="url" alt="" width="68" height="68" />
"Eigentlich nur drei Worte: Ich komme wieder."<span class="author">Tina Gerstung</span></blockquote>
<div class="clear"></div>
</div>
i want all "I want this TEXT". i used xpath query
//div[contains(#class,'bodyText')]/*[local-name()='p' or local-name()='h3']
but it does not give me the text if <p> tag is followed by any other tag
It looks like you have div elements contained within your p element which is not valid and messing up things. If you use a var_dump in the loop you can see that it does actually pick up the node but the nodeValue is empty.
A quick and dirty fix to your html would be to wrap the first div that is contained in the p element in a span.
<span><div class="articleBox articleSmallHorizontal channel-32333770 articleBoxBordered alignRight">...</div></span>
A better fix would be to put the div element outside the paragraph.
If you use the dirty workaround you will need to change your query like so:
$xpath->query("//div[contains(#class,'bodyText')]/*[local-name()='p' or local-name()='h3']/text()");
If you do not have control of the source html. You can make a copy of the html and remove the offending divs:
$nodes = $xpath->query("//div[contains(#class,'articleBox')]");
$node = $nodes->item(0);
$node->parentNode->removeChild($node);
It might be easier to work with simple_html_dom. Maybe you can try this:
include('simple_html_dom.php');
$dom = new simple_html_dom();
$dom->load($html);
foreach($dom->find("div[class=bodyText]") as $parent) {
foreach($parent->children() as $child) {
if ($child->tag == 'p' || $child->tag == 'h3') {
// remove the inner text of divs contained within a p element
foreach($dom->find('div') as $e)
$e->innertext = '';
echo $child->plaintext . '<br>';
}
}
}
This is mixed content. Depending on what defines the position of the element, you can use a number of factors. In this cse, probably simply selected all the text nodes will be sufficient:
//div[contains(#class, 'bodyText')]/(p | h3)/text()
If the union operator within a path location is not allowed in your processor, then you can use your syntax as before or a little bit simpler in my opinion:
//div[contains(#class, 'bodyText')]/*[local-name() = ('p', 'h3')]/text()

Find text without any tag in a div element

I need to access the 48.20 Lac(s) text which is without any tag inside the div,which is the reason I'm not able to access it.
I need to find this in a PHP file.I've tried $html->find('div.priceDetail') followed with trim(strip_tags($result)) which gave me 48.20 Lac(s) + unnecesary text.
Since I've to build a generic file I can't depend on exploding and imploding for a particular fixed case.
<div class="priceDetail">
<b>Total Price :</b>
<img alt="" src="someimage">48.20 Lac(s)
<!-- Per Sq Ft Price -->
<span class="pricePerSqFt">(Price per sq.ft. : Rs. 3,679)</span>
<!-- Code for price Trends -->
<span class="priceGrowth">4 %
<img alt="" src="someimage"
align="absmiddle">
<span class="iconWhatisThis">
<img src="someimage"
class="whatIcon" align="absmiddle">
<span style="" id="StoolTip" class="price_main-c"></span>
</span>
</span>
<div class="tt_top-c">
<span class="priceGrowth"></span>
</div>
<div class="tt_mid-c">
<div class="tt_pointer-c"></div>
<div>
<span class="tt_txt-c">Per sq.ft. price for this property is
<b>higher than the average</b>property price in this locality as per MagicBricks.com
Price Trends.</span>
</div>
<span class="tt_txt-c">
<span class="tp_txt">To know more about this
<a href="#priceTrends" onclick="swithTab('priceTrends', tabbedDivArray);">Click
Here</a>
</span>
</span>
</div>
<div class="tt_bot-c"></div>
</div>
Do as much work as you can with a DOM Parser and then when left with your random load of text, pull the bit out you want with this RegEx:
([0-9]{1,5}?\.[0-9]{2} Lac\(s\))
Result
48.20 Lac(s)
(Change the 5 in the RegEx to the number of digits you want to allow before the decimal point)
Here a solution with DomDocument, probably more robust than Regex :
$DOM = new DOMDocument;
$DOM->loadHTML($str);
//Get all the image tags
$elem = $DOM->getElementsByTagName('img');
//Get the first Image
$first = $elem->item(0);
//Get the node after the image
$txt= $first->nextSibling;
//Get the text
echo $txt->nodeValue;
Of course it requires that the text is always located after the first image in the div.

PHP Regex: remove div with class "image"

I need to remove all images in a variable using the following pattern. (With PHP).
<div class="float-right image">
<img class="right" src="http://www.domain.com/media/images/image001.png" alt="">
</div>
All the div tags will have an image class, but the float-right might vary. I can't seem to get the regex working, please help me.
Use a DOM instead of regex. Example:
<?php
$doc = new DOMDocument();
$doc->loadHTML('<div class="float-right image">
<img class="right" src="http://www.domain.com/media/images/image001.png" alt="">
</div>');
foreach( $doc->getElementsByTagName("div") as $old_img ) {
$img = $doc->createElement("img");
$src = $doc->createAttribute('src');
$class = $doc->createAttribute('class');
$src->value = 'http://your.new.link';
$class->value = 'right';
$img->appendChild($src);
$img->appendChild($class);
$doc->replaceChild($img, $old_img);
}
echo $doc->saveHTML();
?>
This regex matches your pattern:
(?s)<div class="[^"]*">\W*<img\W*class="[^"]*"\W*src="[^"]*"\W*alt="[^"]*">\W*</div>
I have tested it against several strings. It will work on:
<div class="anything">
<img class="blah" src="anything" alt="blah">
</div>
, where you can replace the "blah" and "anything" strings with anything.
Also, the various \W* in the regex allow for different spacing from string to string.
You said you want to do this in PHP.
This will zap all the matched patterns from a page stored in the $my_html variable.
$my_html=preg_replace('%(?s)<div class="[^"]*">\W*<img\W*class="[^"]*"\W*src="[^"]*"\W*alt="[^"]*">\W*</div>%m', '', $my_html);
I think this is what you were looking for?

Categories