Duplicate detection code not working - php

I have a fairly simple piece of code here, i just add a bunch of links in the database, then check each link for a 200 ok.
<?php
function check_alive($url, $timeout = 10) {
$ch = curl_init($url);
// Set request options
curl_setopt_array($ch, array(
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_NOBODY => true,
CURLOPT_TIMEOUT => $timeout,
CURLOPT_USERAGENT => "page-check/1.0"
));
// Execute request
curl_exec($ch);
// Check if an error occurred
if(curl_errno($ch)) {
curl_close($ch);
return false;
}
// Get HTTP response code
$code = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
// Page is alive if 200 OK is received
return $code === 200;
}
if (isset($_GET['cron'])) {
// database connection
$c = mysqli_connect("localhost", "paydayci_gsa", "", "paydayci_gsa");
//$files = scandir('Links/');
$files = glob("Links/*.{*}", GLOB_BRACE);
foreach($files as $file)
{
$json = file_get_contents($file);
$data = json_decode($json, true);
if(!is_array($data)) continue;
foreach ($data as $platform => $urls)
{
foreach($urls as $link)
{
//echo $link;
$lnk = parse_url($link);
$resUnique = $c->query("SELECT * FROM `links_to_check` WHERE `link_url` like '%".$lnk['host']."%'");
// If no duplicate insert in database
if(!$resUnique->num_rows)
{
$i = $c->query("INSERT INTO `links_to_check` (link_id,link_url,link_platform) VALUES ('','".$link."','".$platform."')");
}
}
}
// at the very end delete the file
unlink($file);
}
// check if the urls are alive
$select = $c->query("SELECT * FROM `links_to_check` ORDER BY `link_id` ASC");
while($row = $select->fetch_array()){
$alive = check_alive($row['link_url']);
$live = "";
if ($alive == true)
{
$live = "Y";
$lnk = parse_url($row['link_url']);
// Check for duplicate
$resUnique = $c->query("SELECT * FROM `links` WHERE `link_url` like '%".$row['link_url']."%'");
echo $resUnique;
// If no duplicate insert in database
if(!$resUnique->num_rows)
{
$i = $c->query("INSERT INTO links (link_id,link_url,link_platform,link_active,link_date) VALUES ('','".$row['link_url']."','".$row['link_platform']."','".$live."',NOW())");
}
}
$c->query("DELETE FROM `links_to_check` WHERE link_id = '".$row['link_id']."'");
}
}
?>
I'm trying not to add duplicate urls to the database but they are still getting in, have i missed something obvious with my code can anyone see? i have looked over it a few times, i can't see anything staring out at me.

If you are trying to enforce unique values in a database, you should be relying on the database itself to enforce that constraint. You can add an index (assuming you are using MySQL or a variant, which the syntax appears to be) like this:
ALTER TABLE `links` ADD UNIQUE INDEX `idx_link_url` (`link_url`);
One thing to be aware of is extra spaces as prefixes/suffixes so use trim() on the values and also, you should strip trailing slashes to keep everything consistent (so you don't get dupes) using rtrim().

Related

Foreach loop inside while loop - it never ends

So, I have one curl API call which works fine when I do foreach outside the while loop. Once I move the foreach inside (because I need the values inside) it becomes an infinity loop.
This is the setup
$query = "SELECT id, vote FROM `administrators` WHERE type = 'approved'";
$result = $DB->query($query);
$offset = 0;
$length = 5000;
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
do {
curl_setopt($ch, CURLOPT_URL, "https://api.gov/data?api_key=xxxxxxxxxx&start=1960&sort[0][direction]=desc&offset=$offset&length=$length");
$jsonData = curl_exec($ch);
$response = json_decode($jsonData);
foreach($response->response->data as $finalData){
$allData[] = $finalData;
}
$offset += count($response->response->data);
} while ( count($response->response->data) > 0 );
curl_close($ch);
while($row = $DB->fetch_object($result)) {
foreach ( $allData as $key => $finalData1 ) {
// rest of the code
}
}
Once I run the page it goes infinity or until my browser crash. If I move foreach ( $allData as $key => $finalData1 ) { } outside the while(){} there is no such problem.
Any ideas on what can be the problem here?
UPDATE: // rest of the code
$dataValue = str_replace(array("--","(s)","NA"),"NULL",$finalData1->value);
if($frequency == "dayly") {
if($dataValue) {
$query = "UPDATE table SET $data_field = $dataValue WHERE year = $finalData1->period AND id = $row->id LIMIT 1";
}
}
if(isset($query))
$DB->query($query);
unset($query);
One of the issues could be that where
// rest of the code
is, you have duplicate variable names, thus overriding current positions in arrays and loops.
However, you should change your approach to something like
$rows = Array();
while($row = $DB->fetch_object($result)) $rows[] = $row;
foreach ($rows as $row) {
foreach ($allData as $key => $finalData1) {
// rest of the code
}
}
That way you can read resultset from database faster and free it before you continue.

Optimize PHP MySQL fetch data from API and update table

I have a table in MySQL with approx. 20 million rows.
id | word_eng | word_indic
I have to translate the english word (word_eng) into indian language (word_indic) using google translate api.
I have written PHP code which spawns multiple curl requests and fetches data from API and updates it into the table. But this process is quite slow, about 100 to 200 words per second.
I am using RollingCurl for multi curl.
Whats the best way to make it as fast as possible?
Below is my code. I am running this as a cron job.
<?php
include_once('db.php');
include_once('functions.php');
include_once('rolling-curl-master/RollingCurl.php');
$table = $argv[1];
$q = "SELECT * from $table where word_indic is null limit 500000";
$result = $conn->query($q); $n = 0;
$urls = array();
while ($row = $result->fetch_assoc())
{
$id = $row['id'];
$word = rawurlencode(getName($row['name_eng']));
//getName is a simple function which does some trimming and cleaning up of string
$url = 'https://www.google.com/inputtools/request?text='.rawurlencode($word).'&ime=transliteration_en_te&id='.rawurlencode($id);
array_push($urls, $url);
}
//print_r($urls);
unset($url);
$rc = new RollingCurl("request_callback");
// the window size determines how many simultaneous requests to allow.
$rc->window_size = 300;
foreach ($urls as $url)
{
// add each request to the RollingCurl object
$request = new RollingCurlRequest($url);
$rc->add($request);
}
$rc->execute();
function request_callback($response, $info)
{
// parse the page title out of the returned HTML
if (preg_match("~<title>(.*?)</title>~i", $response, $out)) {
$title = $out[1];
}
//echo "<b>$title</b><br />";
//print_r($info);
$parts = parse_url($info['url']);
parse_str($parts['query'], $query);
$id = $query['id'];
$text = $query['text'];
//echo "<hr>";
$trans = json_decode($response)[1][0][1][0];
global $conn; global $table; global $urls; global $n;
if ($trans != '' and !preg_match('/[a-z]/', $trans))
{
$conn->query("update $table set word_indic='$trans' where id='$id'"); $n++;
}
}
?>

How to do a continuous mysql insert using foreach?

I am trying to insert multiple urls(url_array) into the database.
First, I need to get some information from the urls ( url title, url images,...decsription). I am doing this using SimpleHtmlDom.
So I placing this in a foreach loop. My hope is that it inserts after each iteration until the end of the array.
If there is a bad url, it should skip to the next url in the array.
I also want that at the end( the last iteration) some json success message is parse back through jquery.
For the code I have below, it only inserts sometimes just 2 or 3 of which there are still more urls in the array.
Here is my code :
$txturls = $_POST['bulkurls'];
$urlsArray = array_map('trim', explode(',', $txturls));
//var_dump($urlsArray);
$i = 0;
$len = count($urlsArray);
foreach($urlsArray as $url){
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_NOBODY, true);
$result = curl_exec($curl);
if ($result !== false)
{
$statusCode = curl_getinfo($curl, CURLINFO_HTTP_CODE);
if ($statusCode == 404)
{
$checkExist = "URL Not Exists";
}
else
{
$checkExist = "URL Exists";
}
}
else
{
$checkExist = "URL not Exists";
}
if($checkExist === "URL Exists"){
$html = SimpleHtmlDom::file_get_html($url);
foreach($html->find('title') as $element)
{
$urltitle = $element->plaintext;
}
$tags = get_meta_tags($url);
$description = $tags['description'];
if(strlen($description) < 1){$description = $urltitle .' '.'check title';};
$images = array();
foreach($html->find('meta[property=og:image]') as $element) {
if(!preg_match('/blank.(.*)/i', $element->content) && filter_var($element->content, FILTER_VALIDATE_URL))
{
$images[] = url_to_absolute($linkurl, $element->content);
}
}
foreach ($images as $ext){
//$imagesize = getimagesize(''.$ext.'');
if (pathinfo($ext, PATHINFO_EXTENSION)) {
$image = "<img src=\"$ext\" alt=\"$urltitle\">";
break;
}
}
$linkurl = $url;
$checkIfUrlExist = $this->_modellinks->checkUrlExist($linkurl);
if($checkIfUrlExist == false){
$this->_modellinks->addBulkLinks($data['log_username'], $ipaddress, $linkurl, $urltitle, $subafrolinks, $keywords, $description, $image);
}
} else {
continue;
}
if ($i == $len - 1) {
// last
$output = json_encode(array('type'=>'message', 'text' => 'Thanks for your submission! Your URL has been submitted successfully And its under review . <strong>You will be notified when your post is published online. This usually takes less than 1 hour.<br> Here is your preview link'));
die($output);
}
$i++;
}
Since it is inserting 2-3 URLs, it implies that your insertion query as well as foreach loop are working well.
When you say that there are still more URLs left in the array, are you talking about $urlsArray ?
Have you checked whether flag '$checkExists' is set for remaining URLs?
To do that try pushing these URLs where $checkExists == 'URL Exists' in a new array $arGoodUrl and then work on it. You'll get a better idea then.

Pulling NHL Standings from XML Table with PHP

I'm working on a project in which I pull various statistics about the NHL and inserting them into an SQL table. Presently, I'm working on the scraping phase, and have found an XML parser that I've implemented, but I cannot for the life of me figure out how to pull information from it. The table can be found here -> http://www.tsn.ca/datafiles/XML/NHL/standings.xml.
The parser supposedly generates a multi-dimmensional array, and I'm simply trying to pull all the stats from the "info-teams" section, but I have no idea how to pull that information from the array. How would I go about pulling the number of wins Montreal has? (Solely as an example for the rest of the stats)
This is what the page currently looks like -> http://mattegener.me/school/standings.php
here's the code:
<?php
$strYourXML = "http://www.tsn.ca/datafiles/XML/NHL/standings.xml";
$fh = fopen($strYourXML, 'r');
$dummy = fgets($fh);
$contents = '';
while ($line = fgets($fh)) $contents.=$line;
fclose($fh);
$objXML = new xml2Array();
$arrOutput = $objXML->parse($contents);
print_r($arrOutput[0]); //This print outs the array.
class xml2Array {
var $arrOutput = array();
var $resParser;
var $strXmlData;
function parse($strInputXML) {
$this->resParser = xml_parser_create ();
xml_set_object($this->resParser,$this);
xml_set_element_handler($this->resParser, "tagOpen", "tagClosed");
xml_set_character_data_handler($this->resParser, "tagData");
$this->strXmlData = xml_parse($this->resParser,$strInputXML );
if(!$this->strXmlData) {
die(sprintf("XML error: %s at line %d",
xml_error_string(xml_get_error_code($this->resParser)),
xml_get_current_line_number($this->resParser)));
}
xml_parser_free($this->resParser);
return $this->arrOutput;
}
function tagOpen($parser, $name, $attrs) {
$tag=array("name"=>$name,"attrs"=>$attrs);
array_push($this->arrOutput,$tag);
}
function tagData($parser, $tagData) {
if(trim($tagData)) {
if(isset($this->arrOutput[count($this->arrOutput)-1]['tagData'])) {
$this->arrOutput[count($this->arrOutput)-1]['tagData'] .= $tagData;
}
else {
$this->arrOutput[count($this->arrOutput)-1]['tagData'] = $tagData;
}
}
}
function tagClosed($parser, $name) {
$this->arrOutput[count($this->arrOutput)-2]['children'][] = $this->arrOutput[count($this- >arrOutput)-1];
array_pop($this->arrOutput);
}
}
?>
add this search function to your class and play with this code
$objXML = new xml2Array();
$arrOutput = $objXML->parse($contents);
// first param is always 0
// second is 'children' unless you need info like last updated date
// third is which statistics category you want for example
// 6 => the array you want that has wins and losses
print_r($arrOutput[0]['children'][6]);
//using the search function if key NAME is Montreal in the whole array
//result will be montreals array
$search_result = $objXML->search($arrOutput, 'NAME', 'Montreal');
//first param is always 0
//second is key name
echo $search_result[0]['WINS'];
function search($array, $key, $value)
{
$results = array();
if (is_array($array))
{
if (isset($array[$key]) && $array[$key] == $value)
$results[] = $array;
foreach ($array as $subarray)
$results = array_merge($results, $this->search($subarray, $key, $value));
}
return $results;
}
Beware
this search function is case sensitive it needs modifications like match to
a percentage the key or value changing capital M in montreal to lowercase will be empty
Here is the code I sent you working in action. Pulling the data from the same link you are using also
http://sjsharktank.com/standings.php
I have actually used the same exact XML file for my own school project. I used DOM Document. The foreach loop would get the value of each attribute of team-standing and store the values. The code will clear the contents of the table standings and then re-insert the data. I guess you could do an update statement, but this assumes you never did any data entry into the table.
try {
$db = new PDO('sqlite:../../SharksDB/SharksDB');
$db->setAttribute(PDO::ATTR_ERRMODE,PDO::ERRMODE_EXCEPTION);
} catch (Exception $e) {
echo "Error: Could not connect to database. Please try again later.";
exit;
}
$query = "DELETE FROM standings";
$result = $db->query($query);
$xmlDoc = new DOMDocument();
$xmlDoc->load('http://www.tsn.ca/datafiles/XML/NHL/standings.xml');
$searchNode = $xmlDoc->getElementsByTagName( "team-standing" );
foreach ($searchNode as $searchNode) {
$teamID = $searchNode->getAttribute('id');
$name = $searchNode->getAttribute('name');
$wins = $searchNode->getAttribute('wins');
$losses = $searchNode->getAttribute('losses');
$ot = $searchNode->getAttribute('overtime');
$points = $searchNode->getAttribute('points');
$goalsFor = $searchNode->getAttribute('goalsFor');
$goalsAgainst = $searchNode->getAttribute('goalsAgainst');
$confID = $searchNode->getAttribute('conf-id');
$divID = $searchNode->getAttribute('division-id');
$query = "INSERT INTO standings ('teamid','confid','divid','name','wins','losses','otl','pts','gf','ga')
VALUES ('$teamID','$confID','$divID','$name','$wins','$losses','$ot','$points','$goalsFor','$goalsAgainst')";
$result= $db->query($query);
}

How many maximum urls can I download at one time using curl

I've tested this Curl code to download multiple pages simultaneously. But I want to know what is the maximum permissible limit if any for simultaneous downloads:
<?php
class Footo_Content_Retrieve_HTTP_CURLParallel
{
/**
* Fetch a collection of URLs in parallell using cURL. The results are
* returned as an associative array, with the URLs as the key and the
* content of the URLs as the value.
*
* #param array<string> $addresses An array of URLs to fetch.
* #return array<string> The content of each URL that we've been asked to fetch.
**/
public function retrieve($addresses)
{
$multiHandle = curl_multi_init();
$handles = array();
$results = array();
foreach($addresses as $url)
{
$handle = curl_init($url);
$handles[$url] = $handle;
curl_setopt_array($handle, array(
CURLOPT_HEADER => false,
CURLOPT_RETURNTRANSFER => true,
));
curl_multi_add_handle($multiHandle, $handle);
}
// execute the handles
$result = CURLM_CALL_MULTI_PERFORM;
$running = false;
// set up and make any requests..
while ($result == CURLM_CALL_MULTI_PERFORM)
{
$result = curl_multi_exec($multiHandle, $running);
}
// wait until data arrives on all sockets
while($running && ($result == CURLM_OK))
{
if (curl_multi_select($multiHandle) > -1)
{
$result = CURLM_CALL_MULTI_PERFORM;
// while we need to process sockets
while ($result == CURLM_CALL_MULTI_PERFORM)
{
$result = curl_multi_exec($multiHandle, $running);
}
}
}
// clean up
foreach($handles as $url => $handle)
{
$results[$url] = curl_multi_getcontent($handle);
curl_multi_remove_handle($multiHandle, $handle);
curl_close($handle);
}
curl_multi_close($multiHandle);
return $results;
}
}
Original source:
http://css.dzone.com/articles/retrieving-urls-parallel-curl
No limits but you must consider the connection of internet on your server, bandwidth, memory leaks, CPU and etc

Categories