parsing html code and print out - php

ive this html page ( PART CODE) with multi ( a href="https://twitter.com/$name)
I need to parse all $names and print in page
how i can do this ?
<td>Apr 01 2011<br><b>527
</b>
</td>
<td>
<a href="https://twitter.com/al_rasekhoon" class="twitter-follow-button" data-show count="false" data-lang="" data-width="60px" > al_rasekhoon</a>
</td>
</tr>
<tr class="rowc"><td colspan="11"></td></tr>

You need to loop over your $names array and print a correct a tag for every entry in that array. Like this:
<?php foreach($names as $name){ ?>
<?php echo $name ?>
<?php } ?>

Sounds like screen scraping, and you need to traverse the DOM for this. REs would be very unreliable.
DOMDocument may help you, but you might want to look into a library for screen scraping, such as BeautifulSoup (or some PHP equiv).

If I understand correctly you fetch a html page from somewhere and want to extract all linked twitter users? You can either parse the html code or do this with a bit of string splitting. This code is untested but should give you an idea:
$input = '(the html code)';
$links = explode('<a ', $input); //split input by start of link tags
for ($i = 0; $i < count($links); $i++) {
//cut off everything after the closing '>'
$links[$i] = explode('>', $links[$i], 2)[0]
//skip this link if it doesn't go to twitter.com
if (strpos($links[$i], 'href="twitter.com/') === False) { continue; }
//split by the 'href' attribute and keep everything after 'twitter.com'
$links[$i] = explode('href="twitter.com/', $links[$i], 2)[1]
//cut off everything after the " ending the href attribute
$links[$i] = explode('"', $links[$i], 2)[0]
//now $links[$i] should contain the twitter username
echo $links[$i]
}
Note: if there are other links to twitter on the page that are not the main page or an user, they will get printed too (e.g. if the page links to the twitter FAQ). You would need to filter them manually.
php sucks, let's do this in python!
input = '(the html code)'
links = [l.split(">", 1)[0] for l in input.split("<a ")}
twitter_links = [l for l in links if 'href="twitter.com/' in l]
twitter_hrefs = [l.split('href="twitter.com/', 1)[1] for l in twitter_links]
users = [l.split('"', 1)[0] for l in twitter_hrefs]
print '\n'.join(users)

Related

multi step string split using php

I have a Wordpress site that I am building for a client and one custom post type field will allow users to enter a link and then the text for the link in a format as pictured below and called resources.
That info them needs to be output in an anchor tag as a <li>. I am newish to php and this is what I have so far for code
<ul>
<?php
$rawcontent = get_field("resources");
$rawcontent = preg_replace("!</?p[^>]*>!", "", $rawcontent);
$all_links = preg_split("/(\n)/", $rawcontent);
$firstpart = array_pop(explode(',', $rawcontent));
foreach($all_links as $link) {
if(!trim($link)) continue;
echo "<li><a href='$link'>$firstpart</a></li>";
}
?>
</ul>
when I print $rawcontent (resources) before any of my code executes is apperas as:
www.mylink1.com,link copy 1
www.mylink2.com, link copy 2
www.mylink3.com,link copy 3
with the code I have implemented now it comes out as
How can I get this to return just the link for the href and the just the link copy part for the anchor text for each anchor tag?
I think this will do it.
I first explode on new line just like you do, then I foreach the lines.
When I foreach the lines I explode the line on comma.
Now I have an array with link as first item, and text as the second item.
$str = "www.mylink1.com,link copy 1
www.mylink2.com, link copy 2
www.mylink3.com,link copy 3";
$lines = explode(PHP_EOL, $str);
Foreach($lines as $line){
$linktext = explode(",", $line);
Echo "<li><a href='$linktext[0]'>$linktext[1]</a></li>";
}
https://3v4l.org/9DEoo
I see that your link2 has a space in the text.
You can remove that with trim when you echo.
Echo "<li><a href='" . trim($linktext[0]) . "'>" . trim($linktext[1]) . "</a></li>\n";
I added trim on both link and text, it can be good to have. Just in case...
https://3v4l.org/6RkW3

Is it possible to change original html text in php?

I am trying to make "manner friendly" website. We use different declination dependent on gender and other factors. For example:
You did = robili
It did = robilo
She did = robila
Linguisticaly this is very simplified (and unlucky) example! I would like to change html text in php file where appropriate. For example
<? php
something
?>
html text of the page and somewhere is the word "robil"
<div>we tried to robil^i|o|a^</div>
<? php something ?>
Now I would like to replace all occurences of different tokens ^characters|characters|characters^ and replace them by one of their internal values according to "gender".
It is easy in javascript on the client side, but you will see all this weird "tokenizing" before javascript replace it.
Here I do not know the elegant solution.
Or do you have better idea?
Thanks for advice.
You can add these scripts before and after the HTML:
<?php
// start output buffering
ob_start();
?>
<html>
<body>
html text of the page and somewhere is the word "robil"
<div>we tried to robil^i|o|a^, but also vital^si|sa|ste^, borko^mal|mala|malo^ </div>
</body>
</html>
<?php
$use = 1; // indicate which declination to use (0,1 or 2)
// get buffered html
$html = ob_get_contents();
ob_end_clean();
// match anything between '^' than's not a control chr or '^', min 5 and max 20 chrs.
if (preg_match_all('/\^[^[:cntrl:]\^]{3,20}\^/',$html,$matches))
{
// replace all
foreach (array_unique($matches[0]) as $match)
{
$choices = explode('|',trim($match,'^'));
$html = str_replace($match,$choices[$use],$html);
}
}
echo $html;
This returns:
html text of the page and somewhere is the word "robil" we tried to
robilo, but also vitalsa, borkomala

Simple HTML Dom Crawler returns more than contained in attributes

I would like to extract the contents contained within certain parts of a website using selectors. I am using Simple HTML DOM to do this. However for some reason more data is returned than present in the selectors that I specify. I have checked the FAQ of Simple HTML DOM, but did not see anything that could help me out. I wasn't able to find anything on Stackoverflow either.
I am trying to get the contents/hrefs of all h2 class="hed" tags contained within the ul class="river" on this webpage: http://www.theatlantic.com/most-popular/
In my output I am receiving a lot of data from other tags like p class="dek has-dek" that are not contained within the h2 tag and should not be included. This is really strange as I thought the code would only allow for content within those tags to be scraped.
How can I limit the output to only include the data contained within the h2 tag?
Here is the code I am using:
<div class='rcorners1'>
<?php
include_once('simple_html_dom.php');
$target_url = "http://www.theatlantic.com/most-popular/";
$html = new simple_html_dom();
$html->load_file($target_url);
$posts = $html->find('ul[class=river]');
$limit = 10;
$limit = count($posts) < $limit ? count($posts) : $limit;
for($i=0; $i < $limit; $i++){
$post = $posts[$i];
$post->find('h2[class=hed]',0)->outertext = "";
echo strip_tags($post, '<p><a>');
}
?>
</div>
Output can be seen here. Instead of only a couple of article links, I get information of the author, information on the article, among others.
You are not outputting the h2 contents, but the ul contents in the echo:
echo strip_tags($post, '<p><a>');
Note that the statement before the echo does not modify $post:
$post->find('h2[class=hed]',0)->outertext = "";
Change code to this:
$hed = $post->find('h2[class=hed]',0);
echo strip_tags($hed, '<p><a>');
However, that will only do something with the first found h2. So you need another loop. Here is a rewrite of the code after load_file:
$posts = $html->find('ul[class=river]');
foreach($posts as $postNum => $post) {
if ($postNum >= 10) break; // limit reached
$heds = $post->find('h2[class=hed]');
foreach($heds as $hed) {
echo strip_tags($hed, '<p><a>');
}
}
If you still need to clear outertext, you can do it with $hed:
$hed->outertext = "";
You really only need one loop. Consider this:
foreach($html->find('ul.river > h2.hed') as $postNum => $h2) {
if ($postNum >= 10) break;
echo strip_tags($h2, '<p><a>') . "\n"; // the text
echo $h2->parent->href . "\n"; // the href
}

Php auto go to the next page and scrape

I'm new to Php and Im trynna code a tool that scrape Amazon product title
Right now, I can scrape the first page but I need the tool to go to the next page until there is no page left and do the same task like the 1st page which is scraping.
Here is the code:
<?php
$file_string = file_get_contents('http://www.amazon.com/s/ref=lp_3737671_pg_1?rh=n%3A1055398%2Cn%3A%211063498%2Cn%3A3206324011%2Cn%3A3737671&page=1&ie=UTF8&qid=1361609819');
preg_match_all('/<span class="lrg bold">(.*)<\/span>/i', $file_string, $links);
for($i = 0; $i < count($links[1]); $i++) {
echo $links[1][$i] . '<br>';
}
?>
Any help is appreciate...
To get all pages HTML as one var this would do the trick
<?php
$html = '';
$file_string = file_get_contents('http://www.amazon.com/s/ref=lp_3737671_pg_1?rh=n%3A1055398%2Cn%3A%211063498%2Cn%3A3206324011%2Cn%3A3737671&page=1&ie=UTF8&qid=1361609819');
preg_match_all('/<span class="lrg bold">(.*)<\/span>/i', $file_string, $links);
for($i = 0; $i < count($links[1]); $i++) {
$html .= file_get_contents($links[1][$i]);
}
echo "all pages combined:\n".$html;
?>
However, more than likely your server will time out, run out of memory or something else will go wrong. To scrape HTML content you'd be better off creating a URL list first, then scraping it one at a time. You could do this via a HTML page that calls the scraper via AJAX.

Finding and Printing all Links within a DIV

I am trying to find all links in a div and then printing those links.
I am using the Simple HTML Dom to parse the HTML file. Here is what I have so far, please read the inline comments and let me know where I am going wrong.
include('simple_html_dom.php');
$html = file_get_html('tester.html');
$articles = array();
//find the div the div with the id abcde
foreach($html->find('#abcde') as $article) {
//find all a tags that have a href in the div abcde
foreach($article->find('a[href]') as $link){
//if the href contains singer then echo this link
if(strstr($link, 'singer')){
echo $link;
}
}
}
What currently happens is that the above takes a long time to load (never got it to finish). I printed what it was doing in each loop since it was too long to wait and I find that its going through things I don't need it to! This suggests my code is wrong.
The HTML is basically something like this:
<div id="abcde">
<!-- lots of html elements -->
<!-- lots of a tags -->
<a href="singer/tom" />
<img src="image..jpg" />
</a>
</div>
Thanks all for any help
The correct way to select a div (or whatever) by ID using that API is:
$html->find('div[id=abcde]');
Also, since IDs are supposed to be unique, the following should suffice:
//find all a tags that have a href in the div abcde
$article = $html->find('div[id=abcde]', 0);
foreach($article->find('a[href]') as $link){
//if the href contains singer then echo this link
if(strstr($link, 'singer')){
echo $link;
}
}
Why don't you use the built-in DOM extension instead?
<?php
$cont = file_get_contents("http://stackoverflow.com/") or die("1");
$doc = new DOMDocument();
#$doc->loadHTML($cont) or die("2");
$nodes = $doc->getElementsByTagName("a");
for ($i = 0; $i < $nodes->length; $i++) {
$el = $nodes->item($i);
if ($el->hasAttribute("href"))
echo "- {$el->getAttribute("href")}\n";
}
gives
... (lots of links before) ...
- http://careers.stackoverflow.com
- http://serverfault.com
- http://superuser.com
- http://meta.stackoverflow.com
- http://www.howtogeek.com
- http://doctype.com
- http://creativecommons.org/licenses/by-sa/2.5/
- http://www.peakinternet.com/business/hosting/colocation-dedicated#
- http://creativecommons.org/licenses/by-sa/2.5/
- http://blog.stackoverflow.com/2009/06/attribution-required/

Categories