This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to parse and process HTML with PHP?
I'm trying to scrape a page with PHP using file_get_contents().
This page has some JSON wrapped in a bit of HTML. I'd like to strip out this HTML to be able to use json_decode() on the scraped string so I can deal with the JSON separately.
Is there any clean way to do that? A quick search didn't really lead to anything.
Thanks
parsing/stripping HTML content is always a tricky one because (common?) solutions via regex might crash if the HTML markup is malformed and are painful slow btw. I would suggest using this little HTML DOM parser class:
http://simplehtmldom.sourceforge.net/
edited & added from subcomment:
Okay this is a bad one because the inline javascript is not properly wrapped with CDATA-Tags. Otherwise something like this might work:
$html = new simple_html_dom();
$html->load_file('your-external-file');
foreach($html->find("script") as $obj) {
if(isset($obj->innertext) && strpos($obj->innertext, 'window._jscalls'))
echo $obj->innertext;
}
Related
This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 4 years ago.
I have an XML file that's read using PHP's file_get_contents so other changes can be done to it.
I need to find and remove the nodes <BATCHALLOCATIONS.LIST>...<BATCHALLOCATIONS.LIST> (not just those two lines, but what's between the entire node) in the entire file.
Since the file is already loaded using file_get_contents I'd like to do this without having to load the file again using simpleXML, or an XML parser or any other method (like DOM).
The node does not have a specific parent and appears randomly.
The XML file is exported from a Business Accounting Software.
Any idea on how to achieve this? Maybe using a Regular Expression to do a search and replace or something like that?
I've been trying to do this using a regular expression and preg_replace, but just can't get things to work.
Here's just a portion of the file. The original runs to 10K+ lines.
This should have worked but doesn't
preg_replace('/^\<BATCHALLOCATIONS.LIST\>(.*?)\<\BATCHALLOCATIONS.LIST\>$/ism','', $newXML);
I'm trying to do this without using any HTML/XML parser.
There's probably a better way to do it, but this will work
// get your file as a string
$yourXML = file_get_contents($file) ;
$posStart = stripos($yourXML,'<BATCHALLOCATIONS.LIST>') ;
$posEnd = stripos($yourXML,'</BATCHALLOCATIONS.LIST>') + strlen('</BATCHALLOCATIONS.LIST>') ;
$newXML = substr($yourXML,0,$posStart) . substr($yourXML,$posEnd) ;
This question already has answers here:
Notice: Trying to get property of non-object error
(3 answers)
Closed 7 years ago.
I've been getting an error message for the following piece of code (I'm trying to get the content inside the 'article' tags on a certain web page):
function getTextFromLink($url) {
$html = new DOMDocument();
$html->loadHTML($url);
$text = $html->getElementsByTagName('article')->item(0)->textContent;
return $text;
}
It says that I'm trying to get the property of a non-object on the line with
$text = $html->getElementsbyTagName('article')->item(0)->textContent;
I'm fairly new to php and DOM; what am I missing here?
You have two problems in your code:
The obvious problem is that $html->getElementsByTagName('article')->item(0) is not an object. Specifically, it is null, since the HTML you're parsing doesn't actually contain any article elements. You could've figured this out yourself by following Devon's advice and viewing the value of $html->getElementsByTagName('article')->item(0) using var_dump().
Now, why doesn't your HTML contain any article elements? Well, the real problem turns out to be that the loadHTML() method will load HTML from a string and parse it. That is to say, when you call $html->loadHTML($url);, PHP will parse the contents of the string variable $url as HTML code, and give you a DOMDocument representing the result. Given that you named the variable $url, I'm pretty sure that's not what you want.
What you actually want to use instead is probably loadHTMLFile(), which actually loads HTML code from a named file (or, apparently, URL), rather than from a PHP string.
This question already has answers here:
how to use dom php parser
(4 answers)
Closed 9 years ago.
<?php
$html = file_get_contents('http://xpool.xram.co/index.cgi');
echo $html;
?>
I want to get information in a tag on a remote web site using php. and only the tags.
I found this small string that is great for retrieving the entire site source. However, i want to get a small section only. How can I filter out all the other tags and get only the one tag I need?
I'd suggest using a PHP DOM parser. (http://simplehtmldom.sourceforge.net/manual.htm)
require_once ('simple_html_dom.php');
$html = file_get_contents('http://xpool.xram.co/index.cgi');
$p = $html->find('p'); // Find all p tags.
$specific_class = $html->find('.classname'); // Find elements with classname as class.
$element_id = $html->find('#element'); // Find element with the id element
Read the docs, there are tons of other options available.
This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 9 years ago.
I have php regex to find tag and extract css address from html page
'/<link.*?href\s*=\s*["\']([^"\']+)[^>]*>.*?\/>/i'
but it doesn't work good.can you help me to modify this code?
Perhaps
'/<link .*?(href=[\'|"](.*)?[\'|"]|\/?\>)/i'
Then you can acces the link with $2
Not that this is better than the other answer, however just in case you want to see it, I've altered your regex such that it should work as intended:
'/<link.*?href\s*=\s*["\']([^"\']+?)[\'"]/i'
Regex to find hrefs of all stylesheets can be a tricky task. You should consider using some PHP HTML parser to get this information.
You can read this article to get more information and then try this code.
// Retrieve all links and print their HREFs
foreach($html->find('link') as $e)
echo $e->href . '<br>';
// Retrieve all script tags and print their SRCs
foreach($html->find('script') as $e)
echo $e->src . '<br>';
PS: Remember, your script tag may not contain a src then it will print empty string.
This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 9 years ago.
I have to create a webservice which goes to a specific URL that returns a XML-file as response and interprets/parses this file in order to save its contents to a MySQL database.
I've heard about the SimpleXML but I'm not sure how to get the websites response into a file whose path is needed in order to parse the document.
Can somebody at least explain me how to reach the goal of downloading the XML and saving it to a file? (best with some PHP code)
I will then (hopefully) find out by myself how to parse it and store its contents.
Here's an example of what my XML will look like (for privacy reasons I can't publish the real URL I'm using...)
Here's a couple of pointers..
To download a file and save it, the easiest way I have found is this:
<?php
file_put_contents('saved.xml', file_get_contents('http://www.xmlfiles.com/examples/simple.xml'));
You can then open the file with the simpleXML library like so:
$xml = simplexml_load_file('saved.xml');
var_dump($xml);
Hope that gives you enough info to get started.
See simpleXML for info on the simpleXML library.
You can download and save the xml to a local file by doing this:
$xmlstring = file_get_contents("http://domain.com/webservice/xmlfile.xml");
file_put_contents("path/localxmlfile.xml", $xmlstring);
To parse the xml file I suggest you to use DOMDocument class in combination with the DOMXPath class to query/search for specific elements.
DOMDocument: http://php.net/manual/de/class.domdocument.php
DOMXPath: http://php.net/manual/de/class.domxpath.php
Hopefully you can find your answer on below link. Seems related topic.
How do you parse and process HTML/XML in PHP?