how to get words from html in an array using php?

how to get words from html in an array using php? - php

<?php
$html = file_get_contents('http://hypermedia.ids-mannheim.de/');
?>
this code returns me the html of the website in a string. How do I separate the string into different words? After getting the individual words in an array I would like to detect which one is in German...

$words = explode(' ', strip_tags($html));
or
$words = preg_split("/[\s,]+/", strip_tags($html));
The second one will consider not just the space character as a delimiter, but tabs and commas as well.

work with a regex, something like this
#([\w]+)#i
A code example:
if(preg_match_all('#([\w]+)\b#i', $text, $matches)) {
foreach($matches[1] as $key => $word) {
echo $word."\n";
}
}
Then you have to compare each with some kind of dictionary.

I think you need to separate your problem into steps.
First parse your returned html string to find which part is html tags and structure. You can use DOM for such purpose.
Then, you can separate your innerHTML data from tags and split innerHTML text into tokens to obtain an array. Dunno the best way but a simple array regex split can do the job.
The interesting part of finding german words, could be done matching your wordlist against a dictionary, again using arrays or maps.. or, better, using a DB (SQLlite maybe could be better than a real rdbms like mysql)..

Related

How to find certain text within a php variable and then replace all text between characters either side

I have a variable within PHP coming from a form that contains email addresses all separated by a comma (')
For example:
user#domain1.com,user#domain2.com,user3#domain2.com,user2#domain4.com
What I am trying to achieve is to look at the variable, find for example #domain2.com and remove everything between the comma that are either side of that email.
I know I can use str_replace to replace just the text I'm after, like so:
$emails=str_replace("#domain2.com", "", "$emailscomma");
However, I'm more looking to remove that entire email based on the text I'm asking it to find.
So in this example I'm wanting to remove user#domain2.com and user3#domain2.com
Is this possible?
Searched any articles I could find but couldn't find something that finds something and then replaces but more than just the text it finds.

You can of course use regular expressions, but I would suggest a bit easier way. Operating on arrays is much easier than on strings and substrings. I would convert your string to an array and then filter it.
$emails = "user#domain1.com,user#domain2.com,user3#domain2.com,user2#domain4.com";
// Convert to array (by comma separator)
$emailsArray = explode(',', $emails);
$filteredArray = array_filter($emailsArray, function($email) {
// filters out all emails with '#domain2.com' substring
return strpos($email, '#domain2.com') === false;
});
print_r($filteredArray);
Now you can convert the filtered array to string again. Just use implode() function.

Replace every element of an array in a string with preg_replace() only once, without replacing text which already got replaced

I need to have a function which converts specific words in text into links, so I use preg_replace(). However, I don't know how to skip words that have already been converted into links.
Here's the code:
function highlightWords($content)
{
$arr = array("php", "and sql", "sql");
foreach ($arr as $key=>$value)
{
$content = preg_replace("/\b(".preg_quote($value).")\b/i", '\1', $content, 1);
}
return $content;
}
echo highlightWords("This text will highlight PHP and SQL and sql but not PHPRO or MySQL or sqlite");
The function should make 3 separate links total - "php", "and sql", "sql". Unfortunately, the result looks like this:
This text will highlight PHP and <a href="#">SQL</a> and sql but not PHPRO or MySQL or sqlite
How do I "tell" the function not to process words that are between "a" tags?
P.S. Words in array MUST be ordered RANDOMLY, do not suggest me to re-order an array. I need to modify preg_replace

Use lookarounds to ensure that you don't replace something which is already in an a tag, e.g.
/(?!<a href=\"#\".*)\b(".preg_quote($value).")\b(?!.*<\/a>)/i
(?!<a href=\"#\".*) simply checks that there can't by an a opening tag in front of your search word
(?!.*<\/a>) makes sure that there isn't an a closing tag after your search term

php regular expression breaks

I have the following string in an html.
BookSelector.load([{"index":25,"label":"Science","booktype":"pdf","payload":"<script type=\"text\/javascript\" charset=\"utf-8\" src=\"\/\/www.192.168.10.85\/libs\/js\/books.min.js\" publisher_id=\"890\"><\/script>"}]);
i want to find the src and the publisher_id from the string.
for this im trying the following code
$regex = '#\BookSelector.load\(.*?src=\"(.*?)\"}]\)#s';
preg_match($regex, $html, $matches);
$match = $matches[1];
but its always returning null.
what would be my regex to select the src only ?
what would be my regex if i need to parse the whole string between BookSelector.load ();

Why your regex isn't working?
First, I'll answer why your regex isn't working:
You're using \B in your regex. It matches any position not matched by a word boundary (\b), which is not what you want. This condition fails, and causes the entire regex to fail.
Your original text contains escaped quotes, but your regex doesn't account for those.
The correct approach to solve this problem
Split this task into several parts, and solve it one by one, using the best tool available.
The data you need is encapsulated within a JSON structure. So the first step is obviously to extract the JSON content. For this purpose, you can use a regex.
Once you have the JSON content, you need to decode it to get the data in it. PHP has a built-in function for that purpose: json_decode(). Use it with the input string and set the second parameter as true, and you'll have a nice associative array.
Once you have the associative array, you can easily get the payload string, which contains the <script> tag contents.
If you're absolutely sure that the order of attributes will always be the same, you can use a regex to extract the required information. If not, it's better to use an HTML parser such as PHP's DOMDocument to do this.
The whole code for this looks like:
// Extract the JSON string from the whole block of text
if (preg_match('/BookSelector\.load\((.*?)\);/s', $text, $matches)) {
// Get the JSON string and decode it using json_decode()
$json = $matches[1];
$content = json_decode($json, true)[0]['payload'];
$dom = new DOMDocument;
$dom->loadHTML($content);
// Use DOMDocument to load the string, and get the required values
$script_tag = $dom->getElementsByTagName('script')->item(0);
$script_src = $tag->getAttribute('src');
$publisher_id = $tag->getAttribute('publisher_id');
var_dump($src, $publisher_id);
}
Output:
string(40) "//www.192.168.10.85/libs/js/books.min.js"
string(3) "890"

Regex replace matched subexpression (and nothing else)?

I've used regex for ages but somehow I managed to never run into something like this.
I'm looking to do some bulk search/replace operations within a file where I need to replace some data within tag-like elements. For example, converting <DelayEvent>13A</DelayEvent> to just <DelayEvent>X</DelayEvent> where X might be different for each.
The current way I'm doing this is such:
$new_data = preg_replace('|<DelayEvent>(\w+)</DelayEvent>|', '<DelayEvent>X</DelayEvent>', $data);
I can shorten this a bit to:
$new_data = preg_replace('|(<DelayEvent>)(\w+)(</DelayEvent>)|', '${1}X${2}', $data);
But really all I want to do is simulate a "replace text between tags T with X".
Is there a way to do such a thing? In essence I'm trying to prevent having to match all the surrounding data and reassembling it later. I just want to replace a given matched sub-expression with something else.
Edit: The data is not XML, although it does what appear to be tag-like elements. I know better than parsing HTML and XML with RegEx. ;)

It is possible using lookarounds:
$new_data = preg_replace('|(?<=<DelayEvent>)\w+(?=</DelayEvent>)|', 'X', $data);
See it working online: ideone

xsl to php array

I got a xml file that contains hierarchical data. Now I need to get some of that data into a php array. I use xsl to get the data I want formatted as a php array. But when I print it it leaves all the tabs and extra spaces and line breaks etc which I need to get rid of to turn it into a flat string (I suppose!) and then convert that string into a array.
In the xsl I output as text and have indent="no" (which does nothing). I've tried to strip \t \n \r etc but it doesn't affect the output at all.
Is there a really good php function out there that can strip out all formatting except single spaces? Or is there more going on here I don't know about or another way of doing the same thing?

First off, using xsl output to form your PHP array is fairly inelegant and inefficient. I would highly suggest going with something like the domdocument class available in PHP (http://www.php.net/manual/en/class.domdocument.php). If you must stick with your current method, try using regular expressions to remove any unnecessary whitespace.
$string = preg_replace('/\s+/', '', $string);
or
$string = preg_replace('/\s\s+/', ' ', $string);
to preserve single white space.

I've created a class for opensource library that your welcome to use, and look at as an example on how to create an array from XML (and just take out the "good" parts).
USING XML
So the crux of the problem is probably keeping the data in XML as long as possible. Therefore the after the XSL translation you would have something like:
<xml>
<data>value
with newline
</data>
<data>with lots of whitespace</data>
</xml>
Then you could loop trough that data like:
$xml = simplexml_load_string($xml_string);
foreach($xml as $data)
{
// use str_replace or a regular expression to replace the values...
$data_array[] = str_replace(array(" ", "\n"), "", $data);
}
// $data_array is the array you want!
USING JSON
However if you can't output the XSL into XML then loop through it. Then you may want to use XSL to create a JSON string object and convert that to an array so the xsl would look like:
{
"0" : "value
with newline",
"1" : "with lots of whitespace"
}
Then you could loop trough that data like:
$json_array = json_encode($json_string, TRUE); // the TRUE is to make an array
foreach($json_array as $key => $value)
{
// use str_replace or a regular expression to replace the values...
$json_array[$key] = str_replace(array(" ", "\n"), "", $value);
}
Either way you'll have to pull the values in PHP because XSLT's handling of spaces and newlines is pretty rudimentary.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

how to get words from html in an array using php? - php

<?php $html = file_get_contents('http://hypermedia.ids-mannheim.de/'); ?> this code returns me the html of the website in a string. How do I separate the string into different words? After getting the individual words in an array I would like to detect which one is in German...

$words = explode(' ', strip_tags($html)); or $words = preg_split("/[\s,]+/", strip_tags($html)); The second one will consider not just the space character as a delimiter, but tabs and commas as well.

work with a regex, something like this #([\w]+)#i A code example: if(preg_match_all('#([\w]+)\b#i', $text, $matches)) { foreach($matches[1] as $key => $word) { echo $word."\n"; } } Then you have to compare each with some kind of dictionary.

Related

How to find certain text within a php variable and then replace all text between characters either side

Replace every element of an array in a string with preg_replace() only once, without replacing text which already got replaced

php regular expression breaks

Regex replace matched subexpression (and nothing else)?

xsl to php array

Categories

Resources