I have an XML file similar to this:
<?xml version='1.0'?>
<page>
<desc><title>user</title><username>user</username>
<petcount>0</petcount>
<pagedt><![CDATA[<html><body><p><center><h2>I am amazing.</h2></center>
</p></body></html>]]></pagedt></desc>
<petlist></petlist><friends></friends><messages><message><user>Admin</user><link>/admin</link>
<note>Welcome to My website!</note></message></messages></page>
I am trying to get PHP to only edit the text in <pagedt>, and have a textarea that displays the content that is currently in the file, so far, I have the textarea with the contents, running in a form with phpself and via post. Any ideas as to how I can edit just the contents of pagedt?
Assuming that PHP can read and write the file, all you need is either the XML parser or a regular expression that would match the inner content of <pagedt>, like this:
$fcontent=file_get_contents("file.xml");
$content=array();
preg_match("/^(.*<pagedt>)(.*)(</pagedt>.*)\$/", $fcontent, $content);
//do your stuff on $content[2], which contains what's between <pagedt> and </pagedt>
file_put_contents("file.xml", $content[1].$content[2].$content[3]);
Instead of regular expressions, you can also use String functions like strstr, or combine strpos and substr. But it's no fun then ;)
Have you tried parsing it as XML? I know PHP's DOM library allows you to parse it into an object which you can use to return the pagedt tag and the text contents from there. Try starting here:
http://www.php.net/manual/en/book.dom.php
Related
Hi everyone I have problem on extracting the string on this tag composition.
<text:p>ス<text:span text:style-name="T1">イ</text:span>カ</text:p>
I want to get all the characters inside text:p and text:span tag.
Output should look like this スイカ
How can I compose an xpath pattern to extract above?
Try one of these xpath expressions on the xml in your question:
//*[local-name()='text:p']
or
//text:p
or
string(//text:p)
The output should be:
スイカ
I have a LARGE XML file. I'm troubleshooting some things, and I would like to extract specific nodes from the XML file. I don't want a SimpleXML object, I want to make a new file with the raw string matching what I want (posting this on bash/sed/php).
<?xml version="1.0" encoding="UTF-8"?>
<definition></definition>
<metadata></metadata>
<nodeToRegex>
<nodeImightwant>
<subnode>
<subsubnode1></subsubnode1>
<subsubnodeToCheck>stringCheck</subnodeToCheck>
<subsubnode2></subsubnode2>
</subnode>
</nodeImightwant>
<nodeImightwant></nodeImightwant>
<nodeImightwant></nodeImightwant>
</nodeToRegex>
So from this XML file, I want all lines from every node except the nodeToRegex. From nodeToRegex, I only want the nodeImightwant if the stringCheck string equals "aValidString". Can this be done via regex or should I just copy and paste the stuff out of the file? (my regex skills are subpar)
Don't parse XML with regexes. There is no reason you can't repackage/rearrange the data using SimpleXML, but trying to do it with a regex is a recipe for lots of headaches and, ultimately, broken code.
See this classic example for why parsing XML/HTML/XHTML with regexes is the road to madness.
If you insist on using a regex, just replace the nodes you don't want, like this:
$myxml = preg_replace('~<nodeToRegex>.*?</nodeToRegex>~', '', $myxml);
Debuggex Demo
I'm looking for a simple method to get the first table of a webpage and put the whole thing into a string, that is all.
So I need to know how to use preg_match or similar to get the first instance of a table from a DOM object and get that whole thing into a string:
I have a class to download webpages as DOM but I cannot convert the html to a string as I need it..
$nodes = $this->bot->QuerySelector($this->download['DOM'], "//table[1][#class='tyebfghjftsdf-ccfkk']");
Please help
I would use Tidy to convert page to valid XHTML, then read it using XML reader (not building DOM) and start echoing data when tag is found and terminate on tag. No regular expressions involved.
Consider the xml
<data>
<node1>
some text some text <nested-node>nest node content</nested-node> some text
</node1>
</data>
Want to access <node1> tag (that i can do), But i want to get content as follow...
some text some text <nested-node>nest node content </nested-node>
some text
Please help me how can I achieve this???
The problem is that your XML on top is not well formed.
You cannot use tags in a text sequence. The parser remove such elements.
XML Escaping
List of escape characters
With this:
$xml = new SimpleXMLElement(file_get_contents('xmltest.xml'));
var_dump($xml->node1);
you can read the node.
I got curious of this question. Is it really impossible to do what what OP wants? #Stony noted that malformed XML makes this impossible, and I wouldn't be surprised if the XML functions won't work with malformed XML in the way OP wants.
Here's an example of nesting element in XML: http://www.featureblend.com/xml-nesting.html
If you form your XML like this:
<data>
<master_node>
<nested1>First nested node text.</nested1>
<nested2>Second nested node text.</nested2>
<nested3>Third nested node text.</nested3>
</master_node>
</data>
You are able to get all the text contents:
$xml = new SimpleXMLElement(file_get_contents('xmltest.xml'));
var_dump($xml->master_node);
Maybe it's possible to something similar with JavaScript or plain PHP. Just load the file into variable and parse it with regex or string search functions. Parsing XML with regex, yes I know, it's like parsing HTML with regex, a big no-no...
I am working on URL Get content.
If i want to fetch ONLY the text conent from this site(Only text)
http://en.wikipedia.org/wiki/Asia
How is it possible. I can fetch the URL title and URL using PHP.
I got the url title using the below code:
$url = getenv('HTTP_REFERER');
$file = file($url);
$file = implode("",$file);
//$get_description = file_get_contents($url);
if(preg_match("/<title>(.+)<\/title>/i",$file,$m))
$get_title = $m[1];
echo $get_title;
Could you pl help me to get the content.
Using file_get_content i could get the HTML code alone. Any other possibilities?
Thanks -
Haan
If you just want to get a textual version of a HTML page, then you will have to process it yourself. Fetch the HTML (as you seem to already know how to do) and then process it into plain text with PHP.
There are several approaches to doing this. The first is htmlspecialchars() which will escape all the HTML special characters. I don't imagine this is what you actually want but I thought I'd mention it for completeness.
The second approach is strip_tags(). This will remove all HTML completely from a HTML document. However, it doesn't validate the input its working with, it just does a fairly simple text replace. This means you will end up with stuff that you might not want in the textual representation being included (such as the contents of the head section, or the innards of embedded javascript and stylesheets)
The other approach is to parse the downloaded HTML with DOMDocument. I've not written code for you (don't have time), but the general procedure would be similar to as follows:
Load the HTML into a DOMDocument object
Get the document's body element and iterate over its children.
For each child, if the child in question is a text node, append it to an output string. If it isn't a text node, then iterate over its children as well to check if any of its children are text nodes (and if not then iterate over those child elements as well and so on). You might also want to check the type of the node further. For example, if you don't want javascript or css embedded in the output then you can check that the tag type is not STYLE or SCRIPT and just ignore it if it is.
The above description is most easily implemented as a recursive function (one that calls itself).
The end result should be a string that contains only the textual content of the downloaded page, with no markup.
EDIT: Forgot about strip_tags! I updated my answer to mention that as well. I left my DOMDocument approach included in my answer though, because as the documentation for strip_tags states, it does no validation of the markup its processing, whereas DOMDocument attempts to parse it (and can potentially be more robust if a DOMDocument based text extraction is implemented well).
Use file_get_contents to get the HTML content and then strip_tags to remove the HTML tags, thus leaving only the text.