Extract text including line breaks from XML with PHP - php

When I extract text from an XML file
Here is some text before the
<br/><br/>
line break.
in PHP,
echo $value->description;
I get the text but not the including br tags. How do I get around this?
Thanks.

And from experience, you shouldn't even get any text after the <br/> tags. Reason for this is because all text nodes in XML are suppose to have < and > replaced with their htmlentity() counterparts, and all other special characters replaced with htmlspecialchars(). I'm fairly certain that it causes an error with your XML DOM parser, or at least make it as a new node, an empty text node with a line break, I think.
The only solution for this is to store the XML into a string, use regex to take out the <br/> tags (well, all the < and > tags for that matter), and replace them with the correct values I noted above.
Or, you can read about CDATA here, and escape the tags instead, but that's if you're the one creating that XML file. You should notify the webmaster for the site that you got the XML from, that the XML is incorrectly created.

First, you can read the XML file into one string, and then replace '' by '<br/>'. Now, you can load the replaced string as XML data, and process it with XML DOM.

Related

Parse CDATA with tag inside, using XMLReader PHP

I would need to parse an XML file which contains a CDATA tag. Inside this tag, there is another tag that I want to get. How can I achieve this by using XMLReader?
Example:
<glz:Param name="TITLE">
<![CDATA[Yellow <http://www.yellow.it>]]>
</glz:Param>
How can I get the whole info Yellow <http://www.yellow.it>? I can only get 'Yellow'.
This is my code:
// load file, create a reader variable, etc.
if($reader->nodeType == XMLReader::CDATA)
{
echo $reader->value;
}
As per your comments:
The issue is likely that XmlReader correctly fetches the entire content in the CDATA tag, but your browser inteprets it as html again. Check the page source to see if it contains the a element. If so, try
echo htmlentities($reader->value);
or send a header with content-type: text/plain.
You can get it by string search. As you can see, String "http://www.yellow.it]]>"is not XML so that you can not parse it using XMLReader.
Please search string on it.
For example, you can split the string as "http:" and you can get 2 sub strings.
From second String, you can get the full link without ">]]>".
Hope it will be helpful.

PHP and DOM - parsing error an XML with inside entities

I have a xml :
<title>My title</title>
<text>This is a text and I love it <3 </text>
When I try to parse it with DOM, I have an error because of the "<3":
Warning: DOMDocument::loadXML(): StartTag: invalid element name in Entity...
Do you know how can I escape all inside special char but keeping my XML tree ? The goal is to use this method: $document->loadXML($xmlContent);
Tank a lot for your answers.
EDIT: I forget to say that I cannot modify the XML. I receive it like that and I have to do with it...
The symbol "<" is a predefined entity in XML and thus cannot be used in a text field. It should be replaced with:
<
http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
So the input text should be:
<title>My title</title>
<text>This is a text and I love it <3 </text>
An XML built like that should be rejected, and whoever sends it should replace the predefined entities for the allowed values. Doing said task with tools like htmlentities() and htmlspecialchars(), as Y U NO WORK suggests, is easy and straightforward.
Now, if you really need to parse said data, you need to sanitize it prior to parsing. This is not a recommended behaviour, particularly if you are receiving arbitrary text, but if it is a set of known or predictable characters, regular expressions can do the job.
This one, in particular, will remove a single "<" contained in a "text" element composed by characters, numbers or white spaces:
$xmlContent = preg_replace('/(<text>[a-zA-Z 0-9]*)[<]?([a-zA-Z 0-9]*<\/text>)/', '$1<$2', $xmlContent);
It is very specific, but it is done on purpose: regular expressions are really bad at matching nested structures, such as HTML or XML. Applying more arbitrary regular expressions to HTML or XML can have wildly unexpected behaviours.
XML says that every title has to start with a letter, nothing else is allowed, so the title <3 is not possible.
A workaround for this could be htmlentities() or htmlspecialchars(). But even that wont add a valid character to the beginning, so you should think about either:
Manually add a letter in front of the tag with if
Rework your XML so nothing like that can ever happen.
You need put the content with special chars inside CDATA:
<text><![CDATA[This is a text and I love it <3 ]]></text>

Parsing XML - line feed, carriage return??? I'm confused?

Ok, I have search for about 3 hours and have decided to post this. I am pulling a XML feed and have one XML element that has a bunch of text creating one paragraph. When I look at the source though, I see it broken with carriage returns (as mentioned in the title, not sure if that's correct).
Here is the feed I am pulling from: http://jobs.cbizsoft.com/cbizjobs/jobdetail_post.aspx?cid=cbiz_advantech&jobid=Req-0005
I am using php to build the xml file and then jquery/ajax to build the page as needed.
My question is if I can use php to parse the breaks and format the output to look nicer?
Thanks for the help!
Ok, if I understand you correctly, the problem is when the text is output in your HTML document, then the line breaks are gone. This is because in HTML line breaks (like all white space) is collapse into one space, so
<div>Hello World!</div>
and
<div>Hello
World!</div>
produce the same output.
There are several ways you can solve this:
Put the CSS style white-space: pre-line (or pre-wrap) on the surrounding element.
Or use PHP to replace all line breaks with <br>
Or use a markdown library that basically does the same as the second point, but with additional kinds of formatting such as properly wrapping paragraphs or turn bullet lists in a real HTML list.
You should be using a CDATA section for the description data so that any offending characters are ignored by XML parsers
<Item name='Description' caption='Description'>
<![CDATA[
- Support acquisition and installation processes and coordinate with multiple SPAWAR and Navy stakeholders.
- Analyze acquisition policy life cycle and provide analytical support.
...
]]>
</Item>
Melaos is correct that carriage returns (and other whitespace characters) are valid within XML documents
See the nl2br() function, which will put HTML line breaks for each actual line in the text.
Here's a quicky example with your XML.

Insert CDATA into an XML

I'm in a real hurry right now, and I'm begging REGEX masters for help!
I'm receiving an XML trough a HTTP request, and I just can't parse it since it contains some special chars not being wrapped in CDATA sections.
example XML:
<root>
<node>good node</node>
<node>bad node containing &</node>
<root>
Trying to parse this XML with simplexml_load_string($xml) I get:
Warning: simplexml_load_string() [function.simplexml-load-string]:
Entity: line 3: parser error : xmlParseEntityRef: no name in /..../file.php on line ##
Supposing that the bad nodes will not contain > or <, I need a REGEX that will wrap the text in that nodes in CDATA sections. I guess there will be some lookarounds, I just can't do it quickly.
Thank you!
If you can indeed assume that there will be no < or > characters inside the nodes you want to CDATA-ize, then this should work just fine for your situation:
>(?=[^<&]*&)([^<]*)<
replacing with
<!CDATA[\1]]>
This expression only looks for nodes that contain & characters (whether or not they are part of HTML entities), then wraps the contents of those nodes in a CDATA tag, if you need to ignore & characters inside entities, that's a considerable bit tougher, but I'd be willing to give it a look.

Get Text Content of current URL in php

I am working on URL Get content.
If i want to fetch ONLY the text conent from this site(Only text)
http://en.wikipedia.org/wiki/Asia
How is it possible. I can fetch the URL title and URL using PHP.
I got the url title using the below code:
$url = getenv('HTTP_REFERER');
$file = file($url);
$file = implode("",$file);
//$get_description = file_get_contents($url);
if(preg_match("/<title>(.+)<\/title>/i",$file,$m))
$get_title = $m[1];
echo $get_title;
Could you pl help me to get the content.
Using file_get_content i could get the HTML code alone. Any other possibilities?
Thanks -
Haan
If you just want to get a textual version of a HTML page, then you will have to process it yourself. Fetch the HTML (as you seem to already know how to do) and then process it into plain text with PHP.
There are several approaches to doing this. The first is htmlspecialchars() which will escape all the HTML special characters. I don't imagine this is what you actually want but I thought I'd mention it for completeness.
The second approach is strip_tags(). This will remove all HTML completely from a HTML document. However, it doesn't validate the input its working with, it just does a fairly simple text replace. This means you will end up with stuff that you might not want in the textual representation being included (such as the contents of the head section, or the innards of embedded javascript and stylesheets)
The other approach is to parse the downloaded HTML with DOMDocument. I've not written code for you (don't have time), but the general procedure would be similar to as follows:
Load the HTML into a DOMDocument object
Get the document's body element and iterate over its children.
For each child, if the child in question is a text node, append it to an output string. If it isn't a text node, then iterate over its children as well to check if any of its children are text nodes (and if not then iterate over those child elements as well and so on). You might also want to check the type of the node further. For example, if you don't want javascript or css embedded in the output then you can check that the tag type is not STYLE or SCRIPT and just ignore it if it is.
The above description is most easily implemented as a recursive function (one that calls itself).
The end result should be a string that contains only the textual content of the downloaded page, with no markup.
EDIT: Forgot about strip_tags! I updated my answer to mention that as well. I left my DOMDocument approach included in my answer though, because as the documentation for strip_tags states, it does no validation of the markup its processing, whereas DOMDocument attempts to parse it (and can potentially be more robust if a DOMDocument based text extraction is implemented well).
Use file_get_contents to get the HTML content and then strip_tags to remove the HTML tags, thus leaving only the text.

Categories