lacking photos from external Url php - php

I am using server side for getting photos from external url. I am using simple php dom library for getting this as per SO suggestion. But I am lacking performance in this. I mean for some sites I am not able to get all the photos.
$url has the example external site which is not giving me all the images.
$url
="http://www.target.com/c/baby-baby-bath-bath-safety
/-/N-5xtji#?lnk=nav_t_spc_3_inc_1_1";
$html = file_get_contents($url);
$doc = new DOMDocument();
#$doc->loadHTML($html);
$tags = $doc->getElementsByTagName('img');
foreach ($tags as $tag) {
echo $imageUrl = $tag->getAttribute('src');
echo "<br />";
}
Is this possible I can have functionality/accuracy similar to the option of Firefox
Firefox-> tools -> page info -> media
I mean I just want to be more accurate for this as the existing library is not fetching all images. Also I tried file_get_content...which is also not fetching all the images.

You need to use regular expressions to get images' src. DOMDocument build all DOM structure in memory, You needn't it. When You get URLs, use file_get_contents() and write data to files. Also add max_execution_time if You'll parse many pages.

Download images from remote server
function save_image($sourcePath,$targetPath)
{
$in = fopen($sourcePath, "rb");
$out = fopen($targetPath, "wb");
while ($chunk = fread($in,8192))
{
fwrite($out, $chunk, 8192);
}
fclose($in);
fclose($out);
}
$src = "http://www.example.com/thumbs/thumbs-t2/1/ts_11083.jpg"; //image source
$target = dirname(__FILE__)."/images/pic.jpg"; //where to save image with new name
save_image($src,$target);

Related

php file_get_contents from different URL if first one not available

I have the following code to read an XML file which works well when the URL is available:
$url = 'http://www1.blahblah.com'."param1"."param2";
$xml = file_get_contents($url);
$obj = SimpleXML_Load_String($xml);
How can I change the above code to cycle through a number of different URL's if the first one is unavailable for any reason? I have a list of 4 URL's all containing the same file but I'm unsure how to go about it.
Replace your code with for example this
//instead of simple variable use an array with links
$urls = [ 'http://www1.blahblah.com'."param1"."param2",
'http://www1.anotherblahblah.com'."param1"."param2",
'http://www1.andanotherblahblah.com'."param1"."param2",
'http://www1.andthelastblahblah.com'."param1"."param2"];
//for all your links try to get a content
foreach ($urls as $url) {
$xml = file_get_contents($url);
//do your things if content was read without failure and break the loop
if ($xml !== false) {
$obj = SimpleXML_Load_String($xml);
break;
}
}

How to scrape multiple divs?

Hello I've got a bunch of divs I'm trying to scrape the content values from and I've managed to successfully pull out one of the values, result! However I've hit a brick wall, I want to now pull out the one after it inside the current code I've done. Hit a brick wall here would appreciate any help.
Here is the bit of code i'm currently using.
foreach ($arr as &$value) {
$file = $DOCUMENT_ROOT. $value;
$doc = new DOMDocument();
$doc->loadHTMLFile($file);
$xpath = new DOMXpath($doc);
$elements = $xpath->query("//*[contains(#class, 'covGroupBoxContent')]//div[3]//div[2]");
if (!is_null($elements)) {
foreach ($elements as $element) {
$nodes = $element->childNodes;
foreach ($nodes as $node) {
$maps = $node->nodeValue;
echo $maps;
}
}
}
}
I simply want them all to have separate outputs that I can echo out.
I recommend you use Simple HTML DOM. Beyond that I need to see a sample of the HTML you are scraping.
If you are scraping a website outside your domain I'd recommend saving the source HTML to a file for review and testing. Some websites combat scraping, thus what you see in the browser is not what your scraper would see.
Also, I'd recommend setting a random user agent via ini_set(). If you need a function for this I have one.
<?php
$html = file_get_html($url);
IF ($html) {
$myfile = fopen("testing.html", "w") or die("Unable to open file!");
fwrite($myfile, $html);
fclose($myfile);
}
?>

How do I retrieve text from a doc file php

I am trying to retrieve text from a doc file using php. This is the code that I am using:
function read_doc() {
foreach (glob("*.doc") as $filename) {
$file_handle = fopen($filename, "r"); //open the file
$stream_text = #fread($file_handle, filesize($filename));
$stream_line = explode(chr(0x0D),$stream_text);
$output_text = "";
foreach($stream_line as $single_line){
$line_pos = strpos($single_line, chr(0x00));
if(($line_pos !== FALSE) || (strlen($single_line)==0)){
$output_text .= "";
}else{
$output_text .= $single_line." ";
}
}
$output_text = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t#\/\_\(\)]/", "", $output_text);
echo $output_text;
}
}
I get this result:
HYPERLINK mailtoAnother#email.us Another#email.us Y, dXiJ(x(I_TS1EZBmU/xYy5g/GMGeD3Vqq8K)fw9 xrxwrTZaGy8IjbRcXI u3KGnD1NIBs RuKV.ELM2fiVvlu8zH (W uV4(Tn 7_m-UBww_8(/0hFL)7iAs),Qg20ppf DU4p MDBJlC5 2FhsFYn3E6945Z5k8Fmw-dznZxJZp/P,)KQk5qpN8KGbe Sd17 paSR 6Q
Is there some solution which would clear this up so it returns just a string of text from the doc file?
Doc files are hard to handle with vanilla php.
Using https://github.com/alchemy-fr/PHP-Unoconv I did acomplish what you need. It will acutally detect different formats and produce you with a nice xml. Docs can be found here
There are also a lot of examples on the web if you search for "unoconv" + "php"
Parsing an MS Word doc is tough to do with code.
This is because MS embeds a lot of data into their format, making it look like gibberish as you echo out the parsed words/paragraphs.
I recommend you try a package library (from packagist) to help you with this Word-Doc-Parser
Can be easily installed via composer if you have it on your system.

SimpleXMLElement out of memory with more files

I build a script which have multi curl and extract code souce from multiple urls. It works fine with a few links but if I want to add how I want.. maybe 1000 urls, after 100urls parsed my script crash because is out of memory(ram)
my code:
...
try {
$xml = new SimpleXMLElement($data);
foreach ($xml->url as $url_list) {
$url = $url_list->loc;
$newurls[] = $url;
unset($xml);
}
} catch(Exception $e) {
echo "invalid";
}
}
My $data variable is the result of source code for each url which was get with multi curl.
Is there a way to clear the memory or something?
I already tried to allocate memory from php.ini, the problem is not there.

integrating two small php codes together

I am using php simple dom parser. I have a list of urls (i.e. urls.txt) which I need to download in plain text. What I am trying to achieve here is that iterating urls, extracting html/text and writing extracted texts into a text file (i.e. plain.txt) incrementally. I have written two separate codes, but I need more insight about successfully integrating them into a single one in order to automate the process. Thank you.
<?php
include('simple_html_dom.php');
$Handler = fopen("urls.txt", "a+");
$Urls = fgets($Handler);
while (!feof($Handler)) {
$Urls = fgets($Handler);
echo $Urls ."<br />\n";
}
fclose($Handler);
?>
<?php
$html = file_get_html('http://example.com')->plaintext;
$Dump = fopen("plain.txt", "a+");
fwrite($Dump, $html);
fclose($Dump);
?>
You can create a function for the second script:
function func($url) {
$html = file_get_html($url)->plaintext;
$Dump = fopen("plain.txt", "a+");
fwrite($Dump, $html);
fclose($Dump);
}
and then your first script become:
include('simple_html_dom.php');
$Handler = fopen("urls.txt", "a+");
$Urls = fgets($Handler);
while (!feof($Handler)) {
$Urls = fgets($Handler);
func($Urls);
}
fclose($Handler);

Categories