I have a string and within that string are some links of the format
Text
I want to replace that entire section with a different piece of markup
The problem is that while I can get the overall structure of the markup to be replaced; and also the URL, it's not so easy for me to get the "Text". If I knew the entire link then I might do something like.
'str_replace( $each_link , $my_new_markup , $the_original_string );'
and iterate through each link, but I cant because I cant know what $each_link is going to be exactly.
Is there any way to look for something like this? I am thinking it must have something to do with REGEX but I am totally hopeless at it, and I don't even know if that's the right place to start.
[WILDCARD of some kind]
You could look at a class like this, Simple HTML DOM Parser that you can use to cycle through elements searching for a specific inner html or other attribute and then change it.
Code looking something like this
foreach($html->find('a') as $element) {
if ($element->innertext == $needle) {
$element->innertext = $my_new_markup;
}
}
Related
The requirement is to add an englishText class around all english words on a page. The problem is similar to this, but the Javascript solutions wont work for me. I require a PHP example to solve this problem. For example, if you have this:
<p>Hello, 你好</p>
<div>It is me, 你好</div>
<strong>你好, how are you</strong>
Afterwards I need to end with:
<p><span class="englishText">Hello</span>, 你好</p>
<div><span class="englishText">It is me</span>, 你好</div>
<strong>你好, <span class="englishText">how are you</span></strong>
There are more complicated cases, such as:
<strong>你好, TEXT?</strong>
<div>It is me, 你好</div>
This should become:
<strong>你好, <span class="englishText">TEXT?</span></strong>
<div><span class="englishText">It is me</span>, 你好</div>
But I think I can sort out these edge cases once I know how actually iterate over the document correctly.
I can't use javascript to solve this because:
This needs to work on browsers that don't support javascript
I would prefer to have the classes in place on page load so there isn't any delay in rendering the text in the correct font.
I figured the best way to iterate over the document would be using PHP Simple HTML DOM Parser.
But the problem is that if I try this:
foreach ($html->find('div') as $element)
{
// make changes here
}
My concern is that the following case will cause chaos:
<div>
Hello , 你好
<div>Hello, 你好</div>
</div>
As you can see, it's going to go into the first div and then if I process that node, I will be processing the node within that too.
Any ideas how to get around this and only select the nodes for processing once?
UPDATE
I realise now that what I effectively need is a recursive way to iterate over HTML elements with the ability to change them as I iterate over them.
You should travel through siblings that way you won't get in trouble with such a cases...
Something like that:
<?php
foreach ($html->find('div') as $element)
{
foreach($element->next_sibling() as $sibling){
echo $sibling->plaintext()."\n";
}
}
?>
Or much easier way imo:
Just...
Change every <*> to "\n"."<*>" with preg_replace();
Make an array of lines like $lines = explode("\n",$html_string);
3.
foreach($lines as $line){
$text = strip_tags($line);
echo $text;
}
I am making a website, and I would like to make a custom Markup type language in PHP. I want the tags to be surrounded with [ and ]. Now, I was thinking about this, like anyone would, and I could do something like this:
function formatMarkup($markup = ''){
$markup = str_replace('[color=blue]', '<span style="color: blue">', $markup);
return $markup
}
Even though that might work, it would be more progrematically correct if it would do something like explode(), but starting at every [ and ending at every ]. This would be great if I found out. Thank you for your time and effort.
EDIT:
I have decided to use preg_split(). It seems nice, and all, but I cannot get the regex. Here is my code.
EDIT #2:
I have got most of the regex done, but there are uneeded extra keys in the array. How would I fix them? Here is my new code.
I have made my Markup language. I used
$split = preg_split("/(\[|\])/", $markup);
to get the individual "tags" and used
foreach($split as $k => $v){
if(strlen($v) < 1){
continue;
}
to illiterate through each of them, and check if the value is empty. Then, after that, I would do all of my checks, and parse the code blocks together, and make line, after line, the re-constructed text.
I’m writing a custom parser/data extractor for some pretty shitty HTML.
Changing the HTML is out of the question.
I will spare you the details of the hoops I’ve had to jump through but I’ve now come pretty close to my original goal. I’m using a combination of DOMDocument getElementByName, regular expression replace (I know, I know...), and XPath queries.
I need to get all the text out of the body of the document. I would like for the navigation to remain a separate entity, at least in the abstract. Here’s what I’m doing now:
$contentnodes = $xpath->query("//body//*[not(self::a)]/text()|//body//ul/li/a");
foreach ($contentnodes as $contentnode) {
$type = $contentnode->nodeName;
$content = $contentnode->nodeValue;
$output[] = array( $type, $content);
}
This works, except that of course it treats all of the links on the page differently, and I only want it to do that to the navigation.
What XPath syntax can I use so that, in the first part of that query, before the |, I tell it to get all the text nodes of body’s children except ul > li > a.
Please note that I cannot rely on the presence of p tags or h1 tags or anything sensible like that to make educated guesses about content.
Thanks
Update: #hr_117’s answer below works. I’ve also found that you can use multiple not statements like so:
//body//text()[not(parent::a/parent::li/parent::ul)][not(parent::h1)]
You may try something like this:
//body//text()[not(parent::a/parent::li/parent::ul)]|//body//ul/li/a
//body//*[not(self::a/parent::li/parent::ul)]/text()[normalize-space()]|//body//ul/li/a
(test)
Simple HTML DOM is basically a php you add to your pages which lets you have simple web scraping. It's good for the most part but I can't figure out the manual as I'm not much of a coder. Are there any sites/guides out there that have any easier help for this? (the one at php.net is a bit too complicated for me at the moment) Is there a better place to ask this kind of question?
The site for it is at: http://simplehtmldom.sourceforge.net/manual.htm
I can scrape stuff that has specific classes like <tr class="group">, but not for stuff that's in between. For example.. This is what I currently use...
$url = 'http://www.test.com';
$html = file_get_html($url);
foreach($html->find('tr[class=group]') as $result)
{
$first = $result->find('td[class=category1]',0);
$second = $result->find('td[class=category2]',0);
echo $first.$second;
}
}
But here is the kind of code I'm trying to scrape.
<table>
<tr class="Group">
<td>
<dl class="Summary">
<dt>Heading 1</dt>
<dd>Cat</dd>
<dd>Bacon</dd>
<dt>Heading 2</dt>
<dd>Narwhal</dd>
<dd>Ice Soap</dd>
</dl>
</td>
</tr>
</table>
I'm trying to extract the content of each <dt> and put it to a variable. Then I'm trying to extract the content of each <dd> and put it to a variable, but nothing I tried works. Here's the best I could find, but it gives me back only the first heading repeatedly rather than going to the second.
foreach($html->find('tr[class=Summary]') as $result2)
{
echo $result2->find('dt',0)->innertext;
}
Thanks to anyone who can help. Sorry if this is not clear or that it's so long. Ideally I'd like to be able to understand these DOM commands more as I'd like to figure this out myself rather than someone here just do it (but I'd appreciate either).
TL;DR: I am trying to understand how to use the commands listed in the manual (url above). The 'manual' isn't easy enough. How do you go about learning this stuff?
I think $result2->find('dt',0) gives you back element 0, which is the first. If you omit that, you should be able to get an array (or nodelist) instead. Something like this:
foreach($html->find('tr[class=Summary]') as $result2)
{
foreach ($result2->find('dt') as $node)
{
echo $node->innertext;
}
}
You don't strictly need the outer for loop, since there's only 1 tr in your document. You could even leave it altogether to find each dt in the document, but for tools like this, I think it's a good thing to be both flexible and strict, so you are prepared for multiple rows, but don't accidentally parse dts from anywhere in the document.
I'm currently using this code:
$blog= file_get_contents("http://powback.tumblr.com/post/" . $post);
echo $blog;
And it works. But tumblr has added a script that activates each time you enter a password-field. So my question is:
Can i remove certain parts with file_get_contents? Or just remove everything above the <html> tag? could i possibly kill a whole div so it wont load at all? And if so; how?
edit:
I managed to do it the simple way. By skipping 766 characters. The script now work as intended!
$blog= file_get_contents("powback.tumblr.com/post/"; . $post, NULL, NULL, 766);
After file_get_contents returns, you have in your hands a string. You can do anything you want to it, including cutting out parts of it.
There are two ways to actually do the cutting:
Using string functions like str_replace, preg_replace and others; the exact recipe depends on what you need to do. This approach is kind of frowned upon because you are working at the wrong level of abstraction, but in some cases it has an unmatched performance to time spent ratio.
Parsing the HTML into a DOM tree, modifying it appropriately (this time working at the appropriate level of abstraction) and then turn it back into a string and echo it. This can be more convenient to work with if your requirements are not dead simple and is easier to maintain, but it typically requires more code to be written.
If you want to do something that's most naturally expressed in HTML document terms ("cutting out this <div>") then don't be tempted and go with the second approach.
At that point, $blog is just a string, so you can use normal PHP functions to alter it. Look into these 2:
http://php.net/manual/en/function.str-replace.php
http://us2.php.net/manual/en/function.preg-replace.php
You can parse your output using simple html dom parser and display olythe contents thatyou really want to display