Adding a class to all English text in HTML? - php

The requirement is to add an englishText class around all english words on a page. The problem is similar to this, but the Javascript solutions wont work for me. I require a PHP example to solve this problem. For example, if you have this:
<p>Hello, 你好</p>
<div>It is me, 你好</div>
<strong>你好, how are you</strong>
Afterwards I need to end with:
<p><span class="englishText">Hello</span>, 你好</p>
<div><span class="englishText">It is me</span>, 你好</div>
<strong>你好, <span class="englishText">how are you</span></strong>
There are more complicated cases, such as:
<strong>你好, TEXT?</strong>
<div>It is me, 你好</div>
This should become:
<strong>你好, <span class="englishText">TEXT?</span></strong>
<div><span class="englishText">It is me</span>, 你好</div>
But I think I can sort out these edge cases once I know how actually iterate over the document correctly.
I can't use javascript to solve this because:
This needs to work on browsers that don't support javascript
I would prefer to have the classes in place on page load so there isn't any delay in rendering the text in the correct font.
I figured the best way to iterate over the document would be using PHP Simple HTML DOM Parser.
But the problem is that if I try this:
foreach ($html->find('div') as $element)
{
// make changes here
}
My concern is that the following case will cause chaos:
<div>
Hello , 你好
<div>Hello, 你好</div>
</div>
As you can see, it's going to go into the first div and then if I process that node, I will be processing the node within that too.
Any ideas how to get around this and only select the nodes for processing once?
UPDATE
I realise now that what I effectively need is a recursive way to iterate over HTML elements with the ability to change them as I iterate over them.

You should travel through siblings that way you won't get in trouble with such a cases...
Something like that:
<?php
foreach ($html->find('div') as $element)
{
foreach($element->next_sibling() as $sibling){
echo $sibling->plaintext()."\n";
}
}
?>
Or much easier way imo:
Just...
Change every <*> to "\n"."<*>" with preg_replace();
Make an array of lines like $lines = explode("\n",$html_string);
3.
foreach($lines as $line){
$text = strip_tags($line);
echo $text;
}

Related

Using Simple HTML DOM to Scrape?

Simple HTML DOM is basically a php you add to your pages which lets you have simple web scraping. It's good for the most part but I can't figure out the manual as I'm not much of a coder. Are there any sites/guides out there that have any easier help for this? (the one at php.net is a bit too complicated for me at the moment) Is there a better place to ask this kind of question?
The site for it is at: http://simplehtmldom.sourceforge.net/manual.htm
I can scrape stuff that has specific classes like <tr class="group">, but not for stuff that's in between. For example.. This is what I currently use...
$url = 'http://www.test.com';
$html = file_get_html($url);
foreach($html->find('tr[class=group]') as $result)
{
$first = $result->find('td[class=category1]',0);
$second = $result->find('td[class=category2]',0);
echo $first.$second;
}
}
But here is the kind of code I'm trying to scrape.
<table>
<tr class="Group">
<td>
<dl class="Summary">
<dt>Heading 1</dt>
<dd>Cat</dd>
<dd>Bacon</dd>
<dt>Heading 2</dt>
<dd>Narwhal</dd>
<dd>Ice Soap</dd>
</dl>
</td>
</tr>
</table>
I'm trying to extract the content of each <dt> and put it to a variable. Then I'm trying to extract the content of each <dd> and put it to a variable, but nothing I tried works. Here's the best I could find, but it gives me back only the first heading repeatedly rather than going to the second.
foreach($html->find('tr[class=Summary]') as $result2)
{
echo $result2->find('dt',0)->innertext;
}
Thanks to anyone who can help. Sorry if this is not clear or that it's so long. Ideally I'd like to be able to understand these DOM commands more as I'd like to figure this out myself rather than someone here just do it (but I'd appreciate either).
TL;DR: I am trying to understand how to use the commands listed in the manual (url above). The 'manual' isn't easy enough. How do you go about learning this stuff?
I think $result2->find('dt',0) gives you back element 0, which is the first. If you omit that, you should be able to get an array (or nodelist) instead. Something like this:
foreach($html->find('tr[class=Summary]') as $result2)
{
foreach ($result2->find('dt') as $node)
{
echo $node->innertext;
}
}
You don't strictly need the outer for loop, since there's only 1 tr in your document. You could even leave it altogether to find each dt in the document, but for tools like this, I think it's a good thing to be both flexible and strict, so you are prepared for multiple rows, but don't accidentally parse dts from anywhere in the document.

How to do this string replacement in PHP

I have a string and within that string are some links of the format
Text
I want to replace that entire section with a different piece of markup
The problem is that while I can get the overall structure of the markup to be replaced; and also the URL, it's not so easy for me to get the "Text". If I knew the entire link then I might do something like.
'str_replace( $each_link , $my_new_markup , $the_original_string );'
and iterate through each link, but I cant because I cant know what $each_link is going to be exactly.
Is there any way to look for something like this? I am thinking it must have something to do with REGEX but I am totally hopeless at it, and I don't even know if that's the right place to start.
[WILDCARD of some kind]
You could look at a class like this, Simple HTML DOM Parser that you can use to cycle through elements searching for a specific inner html or other attribute and then change it.
Code looking something like this
foreach($html->find('a') as $element) {
if ($element->innertext == $needle) {
$element->innertext = $my_new_markup;
}
}

How do you access Simple DOM selectors?

I can access some of the 'class' items with a
$ret = $html->find('articleINfo'); and then print the first key of the returned array.
However, there are other tags I need like span=id"firstArticle_0" and I cannot seem to find it.
$ret = $html->find('#span=id[ etc ]');
In some cases something is returned, but it's not an array, or is an array with empty keys.
Unfortunately I cannot use var_dump to see the object, since var_dump produces 1000 pages of unreadable junk. The code looks like this.
<div id="articlething">
<p class="byline">By Lord Byron and Alister Crowley</p>
<p>
<span class="location">GEORGIA MOUNTAINS, Canada</span> |
<span class="timestamp">Fri Apr 29, 2011 11:27am EDT</span>
</p>
</div>
<span id="midPart_0"></span><span class="mainParagraph"><p><span class="midLocation">TUSCALOOSA, Alabama</span> - Who invented cheese? Everyone wants to know. They held a big meeting. Tom Cruise is a scientologist. </p>
</span><span id="midPart_1"></span><p>The president and his family visited Chuck-e-cheese in the morning </p><span id="midPart_2"></span><p>In Russia, 900 people were lost in the balls.</p><span id="midPart_3">
Simple HTML DOM can be used easily to find a span with a specific class.
If want all span's with class=location then:
// create HTML DOM
$html = file_get_html($iUrl);
// get text elements
$aObj = $html->find('span[class=location]');
Then do something like:
foreach($aObj as $key=>$oValue)
{
echo $key.": ".$oValue->plaintext."<br />";
}
It worked for me using your example my output was:
label=span, class=location: Found 1
0: GEORGIA MOUNTAINS, Canada
Hope that helps... and please Simple HTML DOM is great for what it does and easy to use once you get the hang of it. Keep trying and you will have a number of examples that you just use over and over again. I've scraped some pretty crazy pages and they get easier and easier.
Try using this. Worked for me very well and extremely easy to use. http://code.google.com/p/phpquery/
The docs on the PHP Simple DOM parser are spotty on deciphering Open Graph meta tags. Here's what seems to work for me:
<?php
// grab the contents of the page
$summary = file_get_html($url);
// Get image possibilities (for example)
$img = array();
// First, if the webpage has an og:image meta tag, it's easy:
if ($summary->find('meta[property=og:image]')) {
foreach ($summary->find('meta[property=og:image]') as $e) {
$img[] = $e->attr['content'];
}
}
?>

How to number things in PHP?

UPDATE:
I know I can use <ol> directky in the output but I remember using something like:
<?php echo $i++; ?> when I worked on a wordpress blog once. Every time I inserted that tag a number greater than the previous appeared so I basically did:
<?php echo $i++; ?> Text
<?php echo $i++; ?> Text
<?php echo $i++; ?> Text
I'm a front end guy (HTML/CSS) so please excuse this basic question. I just need to know what code in PHP I can use to number some text.
Text
Text
Text
into:
Text
Text
Text
Kind of like what <ol> does in html but in PHP.
Updated answer:
You can use a variable as you already do (the example you are posting should already work). Just initialize it using $i = 0;
Old answer:
You have a fundamental misunderstanding here. PHP is a scripting language, not a markup language. PHP does operations like connecting to data sources, calculating, making additions, changing entries in databases, and so on. PHP code, in short, is a series of commands that are executed. PHP has no design elements, tags and formatting options in itself.
PHP can (and usually does) output HTML (Where you have <ol>) to display things.
You can have an array of arbitrary data in PHP, coming from a file or data source:
$array = array("First chapter", "Second chapter", "Third chapter");
you can output this data as HTML:
echo "<ol>";
foreach ($array as $element) // Go through each array element and output an <li>
echo "<li>$element</li>";
echo "</ol>";
the result being (roughly)
<ol>
<li>First chapter</li>
<li>Second chapter</li>
<li>Third chapter</li>
</ol>
It depends on what type of file you are trying to write. Most often, PHP is writing a webpage in HTML, but not always. In HTML, if you want a numbered list, you should use an ordered list (<ol>).
If you're just writing a text file of some kind, incrementing and outputting a variable (like $i in your example) should work.
You mention Wordpress, so it's worth noting that if you worked on a Wordpress template before, you were using dozens of special functions in the Wordpress library, even though you may not have been completely aware that was what you were doing. A lot of the PHP heavy lifting is hidden and simplified for the templating engine, and if your current project is not built on that engine, you will have to do that logic yourself.

How to keep PHP 'View Source' html output clean [duplicate]

This question already has answers here:
How to properly indent PHP/HTML mixed code? [closed]
(6 answers)
Closed 9 years ago.
This has been bugging me today after checking the source out on a site. I use PHP output in my templates for dynamic content. The templates start out in html only, and are cleanly indented and formatted. The PHP content is then added in and indented to match the html formating.
<ul>
<li>nav1</li>
<li>nav2</li>
<li>nav3</li>
</ul>
Becomes:
<ul>
<?php foreach($navitems as $nav):?>
<li><?=$nav?></li>
<?php endforeach; ?>
</ul>
When output in html, the encapsulated PHP lines are dropped but the white space used to format them are left in and throws the view source formatting all out of whack. The site I mentioned is cleanly formatted on the view source output. Should I assume they are using some template engine? Also would there be any way to clean up the kind of templates I have? with out manually removing the whitespace and sacrificing readability on the dev side?
That's something that's bugging me, too. The best you can do is using tidy to postprocess the text. Add this line to the start of your page (and be prepared for output buffering havoc when you encounter your first PHP error with output buffering on):
ob_start('ob_tidyhandler');
You can't really get clean output from inlining PHP. I would strongly suggest using some kind of templating engine such as Smarty. Aside from the clean output, template engines have the advantage of maintaining some separation between your code and your design, increasing the maintainability and readability of complex websites.
i admit, i like clean, nicely indented html too. often it doesn't work out the way i want, because of the same reasons you're having. sometimes manual indentation and linebreaks are not preserverd, or it doesn't work because of subtemplates where you reset indentation.
and the machines really don't care. not about whitespace, not about comments, the only thing they might care about is minified stuff, so additional whitespace and comments are actually counter-productive. but it's so pretty *sigh*
sometimes, if firebugs not available, i just like it for debugging. because of that most of the time i have an option to activate html tidy manually for the current request. be careful: tidy automatically corrects certain errors (depending on the configuration options), so it may actually hide errors from you.
Does "pretty" HTML output matter? You'll be pasting the output HTML into an editor whenever you want to poke through it, and the editor will presumably have the option to format it correctly (or you need to switch editors!).
I find the suggestions to use an additional templating language (because that's exactly what PHP is) abhorrent. You'd slow down each and every page to correct the odd space or tab? If anything, I would go the other direction and lean towards running each page through a tool to remove the remaining whitespace.
The way I do it is:
<ul>
<?php foreach($navitems as $nav):?>
<li><?=$nav?></li>
<?php endforeach; ?>
</ul>
Basically all my conditionals and loop blocks are flush left within the views. If they are nested, I indent inside the PHP start tag, like so:
<ul>
<?php foreach($navitems as $nav):?>
<?php if($nav!== null) : ?>
<li><?=$nav?></li>
<?php endif; ?>
<?php endforeach; ?>
</ul>
This way, I see the presentation logic clearly when I skim the code, and it makes for clean HTML output as well. The output inside the blocks are exactly where I put them.
A warning though, PHP eats newlines after the closing tag ?>. This becomes a problem when you do something like outputting inside a <pre> block.
<pre>
<?php foreach($vars as $var ) ?>
<?=$var?>
<?php endforeach; ?>
</pre>
This will output:
<pre>
0 1 2 3 4 5 </pre>
This is kind of a hack, but adding a space after the <?=$var?> makes it clean.
Sorry for the excessive code blocks, but this has been bugging me for a long time as well. Hope it helps, after about 7 months.
You few times I have tidied my output for debugging my generated HTML code I have used tabs and newlines... ie;
print "<table>\n";
print "\t<tr>\n";
print "\t\t<td>\n";
print "\t\t\tMy Content!\n";
print "\t\t</td>\n";
print "\t</tr>\n";
print "</table>\n";
I about fell over when I read "I'm really curious why you think it's important to have generated HTML that's "readable". Unfortunately, there were quite a few people on this page (and elsewhere) that think this way...that the browser reads it the same so why worry about the way the code looks.
First, keeping the "code" readable makes debugging (or working in it in general by you or a developer in the future) much easier in almost all cases.
Furthermore, AND MOST IMPORTANTLY, it's referred to as quality of workmanship. It's the difference between a Yugo and a Mercedes. Yes, they are both cars and they both will take you from point "A" to point "B". But, the difference is in the quality of the product with mostly what is not seen. There is nothing worse than jumping into a project and first having to clean up someone else's code just to be able to make sense of things, all because they figured that it still works the same and have no pride in what they do. Cleaner code will ALWAYS benefit you and anyone else that has to deal with it not to mention reflect a level of pride and expertise in what you do.
If it's REAL important in your specific case, you could do this...
<ul><?php foreach($navitems as $nav):?>
<li><?=$nav?></li><?php endforeach; ?>
</ul>
Although that is worse in my opinion, because your code is less readable, even though the HTML is as you desire.
I don't care how clean the output is - it's the original source code that produced it that has to be easy to parse - for me as a developer.
If I was examining the output, I'll run it through tidy to clean it up, if it were required to take a good look at it - but validators don't care about extra spaces or tabs either.
In fact, I'm more likely to strip whitespace out of the output HTML than put any in - less bytes on the wire = faster downloads. not by much, but sometimes it would help in a high traffic scenario (though of course, gzipping the output helps more).
Viewing unformatted source is very annoying with multiple nested divs and many records each containing these divs..
I came across this firefox addon called Phoenix Editor. You can view your source in it's editor and then click "format" and it works like a charm!
Link Here
Try xtemplate http://www.phpxtemplate.org/HomePage its not as well documented as id like, but ive used it to great effect
you would have something like this
<?php
$response = new xtemplate('template.htm');
foreach($navitems as $item)
{
$response->assign('stuff',$item);
$response->parse('main.thelist');
}
$response->parse('main');
$response.out('main');
?>
And the html file would contain
<! -- BEGIN: main -->
<html>
<head></head>
<body>
<ul>
<! -- BEGIN: thelist -->
<li>{stuff}</li>
<!-- END: thelist -->
</ul>
</body>
</html>
I Agree, A clean source is very important, Its well commented, well structured and maintence on those sources, scripts, or code is very quick and simple. You should look into fragmenting your main, using require (prior.php, header.php, title.php, content.php, post.php) in the corresponding places, then write a new function under prior.php that will parse and layout html tags using the explode method and a string splitter, have an integer for tab index, and whenever </ is in the functions string then integer-- whenever < and > but not /> and </ are in the string integer ++ and it all has to be placed properly.... , use a for loop to rebuild another string tabindex to tab the contents integer times.

Categories