Removing portion from scraped array - php

Currently I am scraping a website and I am trying to remove a portion of the code which I don't want to be included in the array.
so the code I have currently
$content['article'] = $html2->find('.hentry-content',0);
$content['article'] = $content['article']->plaintext;
This returns everything within the .hentry-content class on the website I am gathering content from.
Now the content that gets returned looks like this.
array (
[article] => This is some example filler content please no actual meaning behind random bridge for bridge random you dog tomorrow http://example.com/our-random-mp3.com
)
Now at the end of this output it usually includes a random MP3 is there anyway that I can pull just the content portion of the array without the mp3 being included?

if link is inside of <a> tag this should work
foreach($content['article']->find('a') as $item) {
$item->outertext = '';
}
echo $content['article']->plaintext;

If the returned text only contains one link to the random mp3-file you could filter it out with:
$url_pattern = '/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/';
$content['article'] = preg_replace($url_pattern, '', $content['article']->plaintext);
This will remove all urls from the text. I took the url-pattern from http://code.tutsplus.com/tutorials/8-regular-expressions-you-should-know--net-6149.

Related

PHP get 5 additional characters after a specified string of a particular page and list them

Note: I have edited the post on 2017/08/20
I'm trying to obtain a list of product page's URL that goes "www.example.com/product/11111/".
There are over 200 different products available and each of them has its own product page, I want to print out each product in a PDF file.
On "www.example.com/productlist/", there are URLs that lead to each product's page.
So, what I'm trying to do is
Obtain URLs that I need from "www.example.com/productlist/"
Generate PDF files of URLs that I have obtained
Insufficient Information: You did not provide me with much information about the code you already have and how the website will get the 200 URLs, so I can't write the whole code because it depends on the way your website will get the links from.
If you explain more about how the website is supposed to get the links, I will help you put them into an array and save them into a file and implement the rest of the code!
1-) Thing I understood is getting the last 5 characters
As you just want the last 5 characters, not the whole last part, you can do something like this.
$string = "http://example.com/folder/example/1234567"; //your link
$characters = strlen($string); //gets the characters count
$letters = 5; //edit 5 to show more or less manually
$code = substr($string, $characters - $letters, $characters);
echo $code; //will show the last 5 characters
I am always here to help. Good luck!
Start with parse_url().
$parts=parse_url("http://example.com/folder/example/12345");
That will give you an array with a handful of keys. The one you are looking for is path.
Split the path on / and take the last one.
$path_parts=explode('/', $parts['path']);
Your random numbers can now be stored like:
$number = $path_parts[count($path_parts)-1].

Create array of specific substrings

We use a custom CMS, build with PHP MySQL
I have a customer who embeds youtube videos in the content of the site. That is one string, that he can edit with CKeditor. That all works just fine.
He now wants to have those videos displayed on a different location within the same page.
I do not want to create a separate input field in the system just for this, for multiple reasons.
The solution I need is this:
I want to extract the (multiple) < iframe >youtube blah blah< /iframe > from the content string and create an array of iframe strings. Then I can display them elsewhere on the page.
For not displaying videos in the original content location I can use preg_replace to strip the iframes out of the content string.
I however have no idea how to fetch those substrings and form that new array in PHP.
Hope you have an idea and that my explanation is clear.
EDIT after getting the answer from Michel
The complete code I am using now:
$string = '<iframe>youtube iframe</iframe>Some cool text in between blahblah<iframe>moreyoutube</iframe>';
//catch the iframes
$iframe=array();
$parts=explode('<iframe',$string);
if (count($parts) > 1){ //make sure a string without iframes does not end up in the array
foreach($parts as $p){
if( strpos($p,'youtube') !== false ){
$v=explode('</iframe>',$p);
$iframe[]= '<iframe'.$v[0].'</iframe>';
}
}
}
//strip out iframes
$string = preg_replace('/<iframe(.*?)<\/iframe>/', '', $string);
This will give you a string without iframes, and an array of iframes to display seperately.
Thanks to Michel for the answer.
One way of doing it:
explode the content string on <iframe>.
Loop the resulting array and look with strpos for the word youtube (to rule out other iframes on the page).
If you find any, add <iframe> and </iframe> to the result
$string='<div>blabla</div><iframe src="youtube.org.com.uk.sk"></iframe><div>blahblah</div>';
$iframe=array();
$parts=explode('<iframe',$string);
foreach($parts as $p){
if( strpos($p,'youtube') !== false ){
$v=explode('</iframe>',$p);
$iframe[]= '<iframe'.$v[0].'</iframe>';
}
}

PHP Replace tags / placeholders / markers in text string with dynamic values

Basically, what I want to achieve is dynamically replace {SOME_TAG} with "Text".
My idea was to read all tags like {SOME_TAG}, put them into array.
Then convert array keys into variables like $some_tag, and put them into array.
So, this is how far I got:
//Some code goes here
$some_tag = "Is defined somewhere else.";
$different_tag = 1 + $something;
Some text {SOME_TAG} appears in different file, which contents has been read earlier.
//Some code goes here
preg_match_all('/{\w+}/', $strings, $search);
$search = str_replace(str_split('{}'),"",$search[0]);
$search = array_change_key_case( array_flip($search), CASE_LOWER);
...some code missing here, which I cant figure out.
Replace array should look something like this
$replace = array($some_tag, $different_tag);
//Then comes replacing code and output blah blah blah..
How to make array $replace contain variables dynamically depending on $search array?
Why not something along the lines of:
<?php
$replace = array(
'{TAG_1}' => 'hello',
'{TAG_2}' => 'world',
'{TAG_3}' => '!'
);
$myString = '{TAG_1} {TAG_2}{TAG_3}{TAG_3}';
echo str_replace(array_keys($replace), array_values($replace), $myString);
If I understand correctly:
You're working on trying to create a customizable document, using {TAGS} in order to represent replaceable areas that can be filled in with dynamic information. At some point in time while replacing the {TAGS} with the dynamic information, you want the dynamic information to be stored in automatically generated basic variable names, as $tags.
I'm not sure why you want to convert these tags to basic variables instead using them entirely as array keys. I would like to point out that this represents a security or functionality hole - what happens if someone puts {REPLACE} in as a tag in your document? Your replace array would get overwritten with dynamic data, and your whole program would fall apart. Either that, or the whole replace array would get dumped in for {REPLACE}, making for a very messy document with perhaps data you don't WANT them to have in it. Perhaps you have this dealt with - I don't have all the context here - but I thought I'd point out the risk factor.
As for a better solution, unless there's some specific need that you're addressing by going through $tags instead of using using the $replace array directly, I like #Emissary's answer.

How to get a specific part, or div of a website

What I would like to do: get the text headline from the top post on http://reddit.com/r/worldnews and output it to a webpage of mine that will only have that text on it.
In the end, I would like to grab the text from that webpage that I made using AppleScript cURL and output it.
I am making a script that when I click the button it will tell me the top post.
edit If you can think about any way, I would like to do the same thing, but for Facebook notifications.
edit I have PHP grabbing the site and outputting here: http://colejohnsoncreative.com/personal/ai/worldnews.php This is the code that I am using:
<?php
// Get a file into an array. In this example we'll go through HTTP to get
// the HTML source of a URL.
$lines = file('http://www.reddit.com/r/worldnews');
// Loop through our array, show HTML source as HTML source; and line numbers too.
foreach ($lines as $line_num => $line) {
echo "Line #<b>{$line_num}</b> : " . htmlspecialchars($line) . "<br />\n";
}
// Another example, let's get a web page into a string. See also file_get_contents().
$html = implode('', file('http://www.example.com/'));
// Using the optional flags parameter since PHP 5
$trimmed = file('somefile.txt', FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES);
?>
So I get all of the site's code to output, but all I need for the project is
<a class="title " href="http://www.dailymail.co.uk/news/article-2219477/Cannabis-factory-couple-gave-400-000-drug-dealing-fortune-poor-Kenyans-jailed-years.html" >British couple who spent most of the money they made from canabis growing on paying for life changing operations and schooling for people in a poor Kenyan village gets sent to prison for 3 years.</a>
and everything else I need to throw away, how can I do that?
If youre in a shell you can wget the page
From php you could file_get_contents the page
From java you could get it with URLConnection
Once you have it, use what ever language you want to look through the text of the page for what you want, and do whatever you like with it
You gonna have to do some parsing. So match the pattern you want. Simplest is to do something like str_pos to get the position of the elements around what you want or use regex.
Do they have a RSS feed? If so you should use that.

String replace the contents of a div

What I want to do:
I have a div with an id. Whenever ">" occurs I want to replace it with ">>". I also want to prefix the div with "You are here: ".
Example:
<div id="bbp-breadcrumb">Home > About > Contact</div>
Context:
My div contains breadcrumb links for bbPress but I'm trying to match its format to a site-wode bread crumb plugin that I'm using for WordPress. The div is called as function in PHP and outputted as HTML.
My question:
Do I use PHP of Javascript to replace the symbols and how do I go about calling the contents of the div in the first place?
Find the code that's generating the <, and either set the appropriate option (breadcrumb_separator or so) or modify the php code to change the separator.
Modifying supposedly static text with JavaScript is not only a maintenance nightmare, extremely brittle, and might lead to a strange rendering (as users see your site being modified if their system is slow), but will also not work in browsers without (or with disabled) JavaScript support.
You could use CSS to add the you are here text:
#bbp-breadcrumb:before {
content: "You are here: ";
}
Browser support:
http://www.quirksmode.org/css/beforeafter_content.html
You could change the > to >> with javascript:
var htmlElement = document.getElementById('bbp-breadcrumb');
htmlElement.innerHTML = htmlElement.innerHTML.split('>').join('>>').split('>').join('>>')
I don't recommend altering content like this, this is really hacky. You'd better change the ouput rendering of the breadcrumb plugin if possible. Within Wordpress this should be doable.
you can use a regex to match the breadcrumb content.. make the changes on it.. and put it back in the context..
check if this helps you:
$the_existing_html = 'somethis before<div id="bbp-breadcrumb">Home > About > Contact</div>something after'; // let's say this is your curreny html.. just added some context
echo $the_existing_html, '<hr />'; // output.. so that you can see the difference at the end
$pattern ='|<div(.*)bbp-breadcrumb(.*)>(.*)<\/div>|sU'; // find some text that is in a div that has "bbp-breadcrumb" somewhere in its atributes list
$all = preg_match_all($pattern, $the_existing_html, $matches); // match that pattern
$current_bc = $matches[3][0]; // get the text inside that div
$new_bc = 'You are here: ' . str_replace('>', '>>', $current_bc);// replace entity for > with the same thing repeated twice
$the_final_html = str_replace($current_bc, $new_bc, $the_existing_html); // replace the initial breadcrumb with the new one
echo $the_final_html; // output to see where we got

Categories