Search through multiple hosted XML files for a string - php

I'm working on a website that uses a lot of XML-files as data (150 in total and probably growing). Each page is an XML-file.
What I'm looking for is a way to look for a string through the XML-files. I'm not sure what programming language to use for this XML search engine.
I'm familiar with PHP, JavaScript, JQuery. So I'd prefer using those languages.
Thanks a bunch!
UPDATE: I'm looking for a solution that works quickly.
Ideally, the function returns the tagname that contains the searchstring.
If, for instance, the XML is as follows:
<article-1>This is a great story.</article-1>
If one would search for 'story', it would return 'article-1'.
I'm not quite sure on how to do this with a regular expression.

PHP can do this. Here's an example:
foreach(glob("{foldera/*.xml,folderb/*.xml}",GLOB_BRACE) as $filename) {
$xml = simplexml_load_file($filename);
//use regular expressions to find your string
}
You simply iterate through each file on your server using glob() with a foreach loop.

Sounds like a problem that could be solved with grep and regular expressions. Without knowing what string you're looking for it's not possible to say exactly what you should do, but reading some documentation on grep should get you started down the right path.

Related

Regex PHP Function

Yes, I know that people don't like parsing PHP, use a tokenizer they said, it will be great they said... I want you to know it isn't great, it isn't even fine.
I am working in .NET and using PCRE-NET and want to parse some PHP Functions to see if I can do some PHP tree shaking.
I tried using CodeParser which uses Antlr4 to tokenize, the results I got back were horrible to navigate. Yes it is all there technically, but it is so convoluted that really, Regex is better for what I am looking for.
I have the following regex working:
(?<functionScope>\w+)\s*function\s+(?<functionName>\w+)\s*\((?<functionArguments>(?:[^()]+)*)?\s*\)[\s:]*.*(?<functionBody>{(?:[^{}]+|(?-1))*+})
Try it out: https://regex101.com/r/yU6K45/1
This will break up a PHP File into the individual scopes, functions, arguments and function body. I am now looking at the functionBody and wanting to find all functions used inside that function, which I have here:
(?=[^\=\s])((?<functionClass>[$?\w[\w\d]*)?(?<ClassOperator>::|->|\\)?){0,3}?(?<functionName>\w[\w\d]*)\((?<Arguments>.*)?\)
See it at: https://regex101.com/r/3JzPR5/1
An issue I am having is with named groups. When there is a lot of namespacing, the named groups don't work out well. I am wondering if you have any ideas how to split up the line:
$uri = ExtraLevel\Psr7\UriResolver::resolve(Psr7\Utils::uriFor($config['base_uri']), $uri);
To where I would have something like:
Full match ExtraLevel\Psr7\UriResolver::resolve(Psr7\Utils::uriFor($config['base_uri']), $uri)
Group `functionClass` ExtraLevel\
Group `functionClass2` Psr7\
Group `functionClass3` UriResolver::
Group `functionName` resolve
Group `Arguments` Psr7\Utils::uriFor($config['base_uri']), $uri
Would love to match in a way that won't break when there aren't 3-4 levels.

Changing/deleting html from file_get_contents

I'm currently using this code:
$blog= file_get_contents("http://powback.tumblr.com/post/" . $post);
echo $blog;
And it works. But tumblr has added a script that activates each time you enter a password-field. So my question is:
Can i remove certain parts with file_get_contents? Or just remove everything above the <html> tag? could i possibly kill a whole div so it wont load at all? And if so; how?
edit:
I managed to do it the simple way. By skipping 766 characters. The script now work as intended!
$blog= file_get_contents("powback.tumblr.com/post/"; . $post, NULL, NULL, 766);
After file_get_contents returns, you have in your hands a string. You can do anything you want to it, including cutting out parts of it.
There are two ways to actually do the cutting:
Using string functions like str_replace, preg_replace and others; the exact recipe depends on what you need to do. This approach is kind of frowned upon because you are working at the wrong level of abstraction, but in some cases it has an unmatched performance to time spent ratio.
Parsing the HTML into a DOM tree, modifying it appropriately (this time working at the appropriate level of abstraction) and then turn it back into a string and echo it. This can be more convenient to work with if your requirements are not dead simple and is easier to maintain, but it typically requires more code to be written.
If you want to do something that's most naturally expressed in HTML document terms ("cutting out this <div>") then don't be tempted and go with the second approach.
At that point, $blog is just a string, so you can use normal PHP functions to alter it. Look into these 2:
http://php.net/manual/en/function.str-replace.php
http://us2.php.net/manual/en/function.preg-replace.php
You can parse your output using simple html dom parser and display olythe contents thatyou really want to display

Dealing with XML in PHP

I'm currently working a project that has me working with XML a lot. I have to take an XML response and decrypt each text node and then do various tasks with the data. The problem I'm having is taking the response and processing each text node. Originally I was using the XMLToArray library, and that worked fine I would change the XML into an array and then loop through the array and decrypt the values. However some of the XML response I'm dealing with have repeated tags and the XMLToArray library will only return the last values.
Is there a good way that I can take an XML response and process all the text nodes and easily putting the values into an array that has a similar structure to the response?
Thanks in advance.
I would use SimpleXML.
Here's a small example of using it. It loads and parses XML from http://www.w3schools.com/xml/plant_catalog.xml and then outputs values of "COMMON" and "PRICE" tags of each "PLANT" tag.
$xml = simplexml_load_file('http://www.w3schools.com/xml/plant_catalog.xml');
foreach ( $xml->PLANT as $plantNode ) {
echo $plantNode->COMMON, ' - ', $plantNode->PRICE, "\n";
}
If you have any problems with adapting it to your needs, just give an example of your XML so that we can help with it.
All those XML to array libraries are a remain of the times where PHP 4 would force you to write your own XML parser almost from scratch. In recent PHP versions you have a good set of XML libraries that do the hard job. I particularly recommend SimpleXML (for small files) and XMLReader (for large files). If you still find them complicate, you can try phpQuery.
You might want to give SimpleXML a try. Plus it comes by default in php so you dont need to install
Check out SimpleXML, it may offer a bit more for what you are looking for.

What is the fastest way to convert html table to php array?

are there build in functions in latest versions of php specially designed to aid in this task ?
Use a DOM parser like SimpleXML to split the HTML code into nodes, and walk through the nodes to build the array.
For broken/invalid HTML, SimpleHTMLDOM is more lenient (but it's not built in).
String replace and explode would work if the HTML code is clean and always the same, as soon as you have new attributes it will brake.
So only dependable solution would be using regular expressions or XML/HTML parser.
Check http://php.net/manual/en/book.dom.php
An alternative to using a native DOM parser could be using YQL. This way you dont have to do the actual parsing yourself. The YQL Web Service enables applications to query, filter, and combine data from different sources across the Internet.
For instance, to grab the HTML table with the class example given at
http://www.w3schools.com/html/html_tables.asp
you can do
$yql = 'http://tinyurl.com/yql-table-grab';
$yql = json_decode(file_get_contents($yql));
print_r( $yql->query->results );
I've deliberated shortened the URL so it does not mess up the answer. $yql actually links to the YQL API, adds some options and contains the query:
select * from html
where xpath="//table[#class='example']"
and url="http://www.w3schools.com/html/html_tables.asp"
YQL can return JSON and XML. I've made it return JSON and decoded this then, which then results in a nested structure of stdClass objects and Arrays (so it's not all arrays). You have to see if that fits your needs.
You try out the interactive YQL console to see how it works.
i dont know if this is the faster , but you can check this class (using preg_replace)
http://wonshik.com/snippet/Convert-HTML-Table-into-a-PHP-Array
If you want to convert the html-description of a table, here's how I would do it:
remove all closing tags (</...>) ( http://php.net/manual/de/function.str-replace.php)
split string at opening tags (<...>) using a regular expression ( http://php.net/manual/en/function.split.php)
You have to work out the details on your own, since I do not know if you want to handle different lines as subarrays or you want to merge all lines into one big array or something else.
you could use the explode-function to turn the table cols and rows into arrays.
see: php explode

Using PHP PCRE to fetch div content

I'm trying to fetch data from a div (based on his id), using PHP's PCRE. The goal is to fetch div's contents based on his id, and using recursivity / depth to get everything inside it. The main problem here is to get other divs inside the "main div", because regex would stop once it gets the next </div> it finds after the initial <div id="test">.
I've tryed so many different approaches to the subject, and none of it worked. The best solution, in my oppinion, is to use the R parameter (Recursion), but never got it to work properly.
Any Ideais?
Thanks in advance :D
You'd be much better off using some form of DOM parser - regex really isn't suited to this problem. If all you want is basic HTML dom parsing, something like simplehtmldom would be right up your alley. It's trivial to install (just include a single PHP file) and trivial to use (2-3 lines will do what you need).
include('simple-html-dom.php');
$dom = str_get_html($bunchofhtmlcode);
$testdiv = $dom->find('div#test',0); // 0 for the first occurrence
$testdiv_contents = $testdiv->innertext;

Categories