I'm trying to find the position of the HTML element in a HTML document.
So i do this:
$filestring = file_get_contents($filename); //get the raw file
$filestring = htmlspecialchars($filestring);
$pos = strpos($filestring, "<head>"); //find the position of <head>
print_r($pos); //print the position
End print_r don't show nothing. I think it is due to the special characters, but do not understand how to do.
There is no such things as <head> in $filestring.
When you use htmlspecialchars, the < and > get replaced:
http://php.net/manual/en/function.htmlspecialchars.php
$pos = strpos($filestring, "<head>");
Or don't use htmlspecialchars when searching for the string
Why do you use htmlspecialchars?
Do you understand that using this function causes all entities like > or < to be replaces by their representations like > or <?
So, the solutions are
either not use htmlspecialchars
or search not for <head> but for <head>
i think you should remove this line from your code.
$filestring = htmlspecialchars($filestring);
Related
i use (str_replace) function to replace ##ID## in youtube url with this regular expression : (?P<id>[a-z-A-Z_0-9]+)
so i use this code to do this :
<?php
$urlbase = 'https://www.youtube.com/watch?v=##ID##';
$lastchange = str_replace('##ID##', '(<id>[a-z-A-Z_0-9]+)', $urlbase);
echo $lastchange;
?>
i get the output in the browser like this : https://www.youtube.com/watch?v=(?P[a-z-A-Z_0-9]+), its looks like <id> not show up !
i try this simple code :
<?php
echo "This is my <id>";
?>
but i just get this is my in the browser !
What's the probleme ? and how i can fix it , thanks
is being interpreted as HTML so your browser is parsing it and since it is not a renderable element, it shows nothing. Try:
<?php
echo "This is my <id>
?>
As for the str_replace, it's doing exactly what the function is supposed to be doing. If you're looking to use regular expressions in string replacements, use preg_replace
The tag <id> is being removed by your browser. It is really there if you watch the source code. Maybe you should try:
$urlbase = 'https://www.youtube.com/watch?v=##ID##';
$lastchange = str_replace('##ID##', '(<id>[a-z-A-Z_0-9]+)', $urlbase);
echo urlencode( $lastchange );
Problem is with the line:
$lastchange = str_replace('##ID##', '(<id>[a-z-A-Z_0-9]+)', $urlbase);
str_replace does not use regex.
You will need preg_replace
$pattern = '(<id>[a-z-A-Z_0-9]+)'
$replacement = '##ID##'
$string = $urlbase
$lastchange = preg_replace($pattern, $replacement, $string);
Also < and > are html entities which means they are reserved chars for HTML they have some special meanings if you want to show them then you must use there entity name eg < and > in your case respectively.
<?php
echo " echo "This is my <id>";
?>
Is there any way to write
<b style="color:red">asd</b>
without space after b ?
like this
<bstyle="color:red">asd</b>
i want to use it as string without spaces and than i want to display it and it should work properly as html tags
i tried something like
<b style="color:red">asd</b>
but it didnĀ“t work
This is part of string in which i want to find 100th space and than using php split the string. it makes issues if i split it in the midle of tag.
You could use css.
<head>
<style type='text/css'>
b { color:red; }
</style>
</head>
//Other code
<b>abc</b>
Other idea is in the php code. Have you tried to do this?:
<?php
$string = 'asdfasdfa<b style="color:red">asd</b> asdfasdf';
$string2 = str_replace('<b style', '<bstyle', $string);
//And then do the search
$search = strpos($string2, " ", 0);
?>
The only way to have that work is if you use str_replace on the string(s) after removing your spaces, like so:
$str = str_replace('bstyle', 'b style', $str);
where $str is a variable containing the text from which you've removed spaces.
Otherwise, the markup is invalid (and will likely be ignored by the browser).
So what I am trying to do is to match a regular expression which has an opening <p>; tag and a closing </;p> tag.This is the code I wrote:
<?php
$input = "<p>just some text</p> more text!";
$input = preg_replace('/<p>[^(<\/p>)]+?<\/;p>/','<p>$1</p>',$tem);
echo $input;
?>
So the code does not seem to replace <p> with <p> or replace </p> with </p>.I think the problem is in the part where I am checking all characters expect '</p>. I don't think the code [^(<\/p>)] is grouping all the characters correctly. I think it checks if any of the characters are not present and not if the entire group of characters is not present. Please help me out here.
[] in a RegEx is a character group, you can not match strings this way, only characters or unicode codepoints.
If you have escaped HTML entities, you can use htmlspecialchars_decode() to convert them back into characters.
After you have valid HTML, you can use the DOM to to parse, traverse and manipulate it.
How do you parse and process HTML/XML in PHP?
I think i figured it out.Here is the code:
<?php
$input = "<p>text</p>";
$tem = $input;
$tem = htmlspecialchars($input);
$tem = preg_replace('/<p>(.+?)<\/p>/','<p>$1</p>',$tem);
echo $tem;
?>
You don't need to capture the content between p tags, you only need to replace p tags:
$html = preg_replace('~<(/?p)>~', '<$1>', $html);
However, you don't regex too:
$trans = array('<p>' => '<p>', '</p>' => '</p>');
$html = strtr($html, $trans);
At least part of the trouble you're having is probably due to the fact that you seem to be playing fast and loose with the semicolons in your HTML entities. They always start with an ampersand, and end with a semicolon. So it's >, not > as you have scattered through your post.
That said, why not use html_entity_decode(), which doesn't require abusing regular expressions?
$string = 'shoop <p>da</p> woop';
echo html_entity_decode($string);
// output: shoop <p>da</p> woop
I'm using PHP to get all the "script" tags from web pages, and then appending text after the </script> that is not always valid html. Because it's not always valid markup I can't just use appendchild/replacechild to add that information, unless I'm misunderstanding how replacechild works.
Anyway, when I do
$script_tags = $doc->getElementsByTagName('script');
$l = $script_tags->length;
for ($i = $l - 1; $i > -1; $i--)
$script_tags_string = $doc->saveXML($script_tags->item($i));
This puts "<![CDATA[" and "]]>" around the contents of the script tag. How can I disable this? Please don't tell me to just delete it afterwards, that's what I'm going to do if I can't find a solution for this.
I have a suspicion that the CDATA is inserted because it would otherwise be invalid XML.
Have you tried using saveHTML instead of saveXML?
One way I've found to fix this:
Before echoing the document, make a loop around all script tags, and use str_replace for "<", ">" to some string, make sure to only use that string inside script tags.
Then, use the method saveXML() in a variable, and finally use str_replace replacing "STRING" to "<" or ">"
Here is the code:
<?php
//First loop
foreach($dom->getElementsByTagName('script') as $script){
$script->nodeValue = str_replace("<", "ESCAPE_CHAR_LT", $script->nodeValue);
$script->nodeValue = str_replace(">", "ESCAPE_CHAR_GT", $script->nodeValue);
}
//Obtaining XHTML
$output = $dom->saveXML();
//Seccond replace
$output = str_replace("ESCAPE_CHAR_LT", "<", $output);
$output = str_replace("ESCAPE_CHAR_GT", ">", $output);
//Print document
echo $output;
?>
As you can see, now you are free to use "<" ">" in your scripts.
Hope this helps someone.
Let's say I have a page I want to scrape for words with "ice" in them, how can I do this easily? I see a lot of scrapers breaking things down into source code, but I don't need this. I just need something that searches through the plain text on the webpage.
Edit: I basically need something to search for .jpeg and find the entire file name. (it is in plain text on the website, not hidden in a tag)
Anything that matches the following is a word with ice in it:
/(\w*)ice(\w*)/i
(Do note that \w matches 0-9 and _ too. The following might give better results: /\b.*?ice\b.*?/i)
UPDATE
To match file names (must not contain whitespace):
/\S+\.jpeg/i
Example:
<?php
$str = 'Picture of me: 238484534.jpeg and someone else img-of-someone.jpeg here';
$cnt = preg_match_all('/\S+\.jpeg/i', $str, $matches);
print_r($matches);
1.do u want to read the word inside the HTML tags too like attribute,textname ?
2.Or only the visible part of the webpage ?
for#1 : solutions are simple and already there as mentioned in other answers.
for#2:
Use PHP DOMDOCUMENT class, and extract and search in innerHTML only.
documentation here :
http://php.net/manual/en/class.domdocument.php
see this for example:
PHP DOMDocument stripping HTML tags
Some regex use will be needed for this. Below I use PCRE http://www.php.net/manual/en/ref.pcre.php and the function preg_match http://www.php.net/manual/en/function.preg-match-all.php
<?php
$html = <<<EOF
<html>
<head>
<title>Test</title>
</head>
<body>List of files:
<ul>
<li>test1.jpeg</li>
<li>test2.jpeg</li>
</ul>
</body>
</html>
EOF;
$matches = array();
$count = preg_match_all("([0-9a-zA-Z_-]+\.jpeg)", $html, $matches);
if (count($matches) > 1) {
for ($i = 1; $i < count($matches); $i++) {
print "Filename: {$matches[$i]}\n";
}
}
?>
try this:
preg_match_all('/\w*ice\w*/', 'abc icecream lice', $matches);
print_r($matches);