PHP crawling data from website - php

I am currently trying to crawl alot of data from a website, however I am struggling a little bit with it. It has an a-z index and 1-20 index, so it has a bunch of loops and DOM stuff in there. However, it managed to crawl and save about 10.000 rows at first run, but now I am at around 15.000 and it is only crawling around 100 per run.
It is probably because it has to skip the rows that it already has inserted, (made a check for that). I cant think of a way to easily skip some pages, as the 1-20 index varies a lot (for one letter there are 18 pages, other letter are only 2 pages).
I was checking if there already was an record with the given ID, if not, insert it. I assumed that would be slow, so now before the script stars I retrieve all rows, and then check with an in_array(), assuming thats faster. But it just wont work.
So my crawler is navigating 26 letters, 20 pages each letter, and then up to 50 times each page, so if you calculate it, its a lot.
Thought of running it letter by letter, but that wont really work as I am still stuck at "a" and cant just hop onto "b" as I will miss records from "a".
Hope I have explained the problem good enough for someone to help me. My code kinda looks like this: (I have removed some stuff here and there, guess all the important stuff is in here to give you an idea)
function in_array_r($needle, $haystack, $strict = false) {
foreach ($haystack as $item) {
if (($strict ? $item === $needle : $item == $needle) || (is_array($item) && in_array_r($needle, $item, $strict))) {
return true;
}
}
return false;
}
/* CONNECT TO DB */
mysql_connect()......
$qry = mysql_query("SELECT uid FROM tableName");
$all = array();
while ($row = mysql_fetch_array($qru)) {
$all[] = $row;
} // Retrieving all the current database rows to compare later
foreach (range("a", "z") as $key) {
for ($i = 1; $i < 20; $i++) {
$dom = new DomDocument();
$dom->loadHTMLFile("http://www.crawleddomain.com/".$i."/".$key.".htm");
$finder = new DomXPath($dom);
$classname="table-striped";
$nodes = $finder->query("//*[contains(concat(' ', normalize-space(#class), ' '), ' $classname ')]");
foreach ($nodes as $node) {
$rows = $finder->query("//a[contains(#href, '/value')]", $node);
foreach ($rows as $row) {
$url = $row->getAttribute("href");
$dom2 = new DomDocument();
$dom2->loadHTMLFile("http://www.crawleddomain.com".$url);
$finder2 = new DomXPath($dom2);
$classname2="table-striped";
$nodes2 = $finder2->query("//*[contains(concat(' ', normalize-space(#class), ' '), ' $classname2 ')]");
foreach ($nodes2 as $node2) {
$rows2 = $finder2->query("//a[contains(#href, '/loremipsum')]", $node2);
foreach ($rows2 as $row2) {
$dom3 = new DomDocument();
//
// not so important variable declarations..
//
$dom3->loadHTMLFile("http://www.crawleddomain.com".$url);
$finder3 = new DomXPath($dom3);
//2 $finder3->query() right here
$query231 = mysql_query("SELECT id FROM tableName WHERE uid='$uid'");
$result = mysql_fetch_assoc($query231);
//Doing this to get category ID from another table, to insert with this row..
$id = $result['id'];
if (!in_array_r($uid, $all)) { // if not exist
mysql_query("INSERT INTO')"); // insert the whole bunch
}
}
}
}
}
}
}

$uid is not defined, also, this query makes no sense:
mysql_query("INSERT INTO')");
You should turn on error reporting:
ini_set('display_errors',1);
error_reporting(E_ALL);
After your queries you should do an or die(mysql_error());
Also, I might as well say it, if I don't someone else will. Don't use mysql_* functions. They're deprecated and will be removed from future versions of PHP. Try PDO.

Related

I don't know what to do with the probability draw

I am currently giving a score of one or more people and choosing to draw more than one person based on the score. However, there is still a runtime error and is not being resolved. Please let me know how to this problem solve.
this is my code:
$selectChildrens = array();
for($i=0;$i<$recuTotal;$i++){
$random = rand(0,sizeof($childSelectArray)-1);
$selectChild = $childSelectArray[$random];
$sameCheck = 0;
if(sizeof($selectChildrens) == 0){
array_push($selectChildrens,$selectChild);
while(($key = array_search($selectChild,$childSelectArray)) != NULL){
unset($childSelectArray[$key]);
}
$recuTotal—;
$i=0;
}else{
array_push($selectChildrens,$selectChild);
while(($key2 = array_search($selectChild,$childSelectArray)) != NULL){
unset($childSelectArray[$key2]);
}
$recuTotal—;
$i=0;
}
}
I see that your attempt to decrement $recuTotal is not using correct syntax.
Use this line instead: $recuTotal--
You are using a long dash, but need two hyphens.
As for your array_search() lines, I always use: !==false though I'm not sure it matters.
Lastly, you can use !sizeof($selectChildrens) as a shorter if statement.
Though I'll admit I didn't fully comb your code to see what it is doing, this is a DRYer code block that will perform the same way:
$selectChildrens = array();
for($i=0;$i<$recuTotal;$i++){
$random = rand(0,sizeof($childSelectArray)-1);
$selectChild = $childSelectArray[$random];
$sameCheck = 0;
array_push($selectChildrens,$selectChild);
while(($key=array_search($selectChild,$childSelectArray))!==false){
unset($childSelectArray[$key]);
}
$recuTotal--;
$i=0;
}

Function Errors in PHP Script

I've been wrestling with a really cool script someone gave me, trying to adapt it to my site. I'm getting closer, but I'm still getting two errors that have me puzzled.
First: Warning: Invalid argument supplied for foreach()...
This is the foreach statement:
foreach ($Topic as $Topics)
It follows a function:
function generate_menu_items($PopTax, $Topics, $Current_Topic)
I THINK the problem relates to the middle value in the function - $Topics. I don't understand how it's derived. My guess is it's supposed to be an array of all the possible topics (represented by $MyTopic in my database). But I'm not that familiar with functions, and I don't understand why he put the function and foreach BEFORE the database queries. (However, there is a more general DB query that establishes some of these values higher up the food chain.)
Here's the second problem: Fatal error: Call to undefined function render_menu()...
Can anyone tell me how and where I should define this function?
Let me briefly explain what this script is all about. First imagine these URL's:
MySite/topics/animal
MySite/topics/animal-homes
MySite/topics/animal-ecology
MySite/topics/mammal-ecology
MySite/topics/bird-ecology
Two key values are associated with each URL - $PopTax (popular name) and $MyTopic. For the first three URL's, $PopTax = Animal, while the other two are Mammal and Bird. $MyTopic = Ecology for the last three rows. For the first two, $MyTopics = Introduction and Homes.
The ID's for both values (Tax_ID and Topic_ID) are simply the first three letters of the name (e.g. Mam = Mammal, Eco = Ecology). Also, Life is the Parent of Animal, which is the Parent of Vertebrate, which is the Parent of Mammal.
Now I'm just trying to pull it all together to create a little index in the sidebar. So if you visit MySite/topics/animal-ecology, you'd see a list of ALL the animal topics in the sidebar...
Animals
Animal Classification
Animal Homes
Animal Ecology
As you can see there are some case and plural differences (animal vs Animals), though I don't think that really relates to the problems I'm having right no.
But I'm not sure if my code just needs to be tweaked or if there's something grotesquely wrong with it. Something doesn't look right to me. Thanks for any tips.
$Tax_ID = 'Mam'; // Mam represents Mammal
$Current_Topic = 'Homes';
function generate_menu_items($PopTax, $Topics, $Current_Topic)
{
$menu_items = array();
foreach ($Topic as $Topics)
{
$url = "/topics/$PopTax[PopTax]-$Topic[MyTopic]";
$title = "$PopTax[PopTax] $Topic[MyTopic]";
$text = $Topic['MyTopic'];
if ($Topic === 'People') {
$url = "$PopTax[PopTax]-and-$Topic[MyTopic]";
$title = "$PopTax[PopTax] and $Topic[MyTopic]";
$text = "$PopTax[PopTax] & $Topic[MyTopic]";
}
if ($Topic === 'Movement' && $PopTax['Parent'] == 'Ver' && $PopTax['PopTax'] != 'Human') {
$url = "$PopTax[PopTax]-locomotion";
$title = "$PopTax[PopTax] Locomotion";
$text = "Locomotion";
}
$menu_items[] = array(
'url' => strtolower($url),
'title' => ucwords($title),
'text' => ucfirst($text),
'active' => ($Topic['MyTopic'] === $Current_Topic)
);
}
return $menu_items;
}
function generate_menu_html($menu_items)
{
$list_items = array();
foreach ($menu_items as $item)
{
if ($item['active']) {
$list_items[] = "<li><span class=\"active\">$item[text]</b></span></li>";
} else {
$list_items[] = "<li>$item[text]</li>";
}
}
return '<ol>' . implode("\n", $list_items) . '</ol>';
}
$stm = $pdo->prepare("SELECT T.Topic_ID, T.MyTopic
FROM gz_topics T
JOIN gz_topics_poptax TP ON TP.Topic_ID = T.Topic_ID
WHERE TP.Tax_ID = :Tax_ID");
$stm->execute(array('Tax_ID' => $Tax_ID));
// Fetch all rows (topics) as an associative array
$Topics = $stm->fetchAll(PDO::FETCH_ASSOC);
// Get the DB row for the taxon we're dealing with
$stm = $pdo->prepare("SELECT Tax.ID, Tax.PopTax, Tax.Parent
FROM gz_poptax Tax
WHERE Tax.ID = :Tax_ID");
$stm->execute(array('Tax_ID' => $Tax_ID));
// Fetch a single row, as the query should only return one row anyway
$PopTax = $stm->fetch(PDO::FETCH_ASSOC);
// Call our custom functions to generate the menu items, and render them as a HTML list
$menu_items = generate_menu_items($PopTax, $Topics, $Current_Topic);
$menu_html = render_menu($menu_items);
// Output the list to screen
echo $menu_html;
You want foreach ($Topics as $Topic). You are looping over each $Topic in $Topics is another way to think of it.

php sql find and insert in empty slot

I have a game script thing set up, and when it creates a new character I want it to find an empty address for that players house.
The two relevant table fields it inserts are 'city' and 'number'. The 'city' is a random number out of 10, and the 'number' can be 1-250.
What it needs to do though is make sure there's not already an entry with the 2 random numbers it finds in the 'HOUSES' table, and if there is, then change the numbers. Repeat until it finds an 'address' not in use, then insert it.
I have a method set up to do this, but I know it's shoddy- there's probably some more logical and easier way. Any ideas?
UPDATE
Here's my current code:
$found = 0;
while ($found == 0) {
$num = (rand()%250)+1; $city = (rand()%10)+1;
$sql_result2 = mysql_query("SELECT * FROM houses WHERE city='$city' AND number='$num'", $db);
if (mysql_num_rows($sql_result2) == 0) { $found = 1; }
}
You can either do this in PHP as you do or by using a MySQL trigger.
If you stick to the PHP way, then instead of generating a number every time, do something like this
$found = 0;
$cityarr = array();
$numberarr = array();
//create the cityarr
for($i=1; $i<=10;$i++)
$cityarr[] = i;
//create the numberarr
for($i=1; $i<=250;$i++)
$numberarr[] = i;
//shuffle the arrays
shuffle($cityarr);
shuffle($numberarr);
//iterate until you find n unused one
foreach($cityarr as $city) {
foreach($numberarr as $num) {
$sql_result2 = mysql_query("SELECT * FROM houses
WHERE city='$city' AND number='$num'", $db);
if (mysql_num_rows($sql_result2) == 0) {
$found = 1;
break;
}
}
if($found) break;
}
this way you don't check the same value more than once, and you still check randomly.
But you should really consider fetching all your records before the loops, so you only have one query. That would also increase the performance a lot.
like
$taken = array();
for($i=1; $i<=10;$i++)
$taken[i] = array();
$records = mysql_query("SELECT * FROM houses", $db);
while($rec = mysql_fetch_assoc($records)) {
$taken[$rec['city']][] = $rec['number'];
}
for($i=1; $i<=10;$i++)
$cityarr[] = i;
for($i=1; $i<=250;$i++)
$numberarr[] = i;
foreach($cityarr as $city) {
foreach($numberarr as $num) {
if(in_array($num, $taken[]) {
$cityNotTaken = $city;
$numberNotTaken = $number;
$found = 1;
break;
}
}
if($found) break;
}
echo 'City ' . $cityNotTaken . ' number ' . $numberNotTaken . ' is not taken!';
I would go with this method :-)
Doing it the way you say can cause problems when there is only a couple (or even 1 left). It could take ages for the script to find an empty house.
What I recommend doing is insert all 2500 records in the database (combo 1-10 with 1-250) and mark with it if it's empty or not (or create a combo table with user <> house) and match it on that.
With MySQL you can select a random entry from the database witch is empty within no-time!
Because it's only 2500 records, you can do ORDER BY RAND() LIMIT 1 to get a random row. I don't recommend this when you have much more records.

PHP Script Optimization - MySql Search and Sort

I have developed a fairly simple script to search a database and then sort the results based on the search terms used, so trying to get most relevant first.
Now this ran fine on my local machine and before I put in the sorting ran okay on the web server I have hired but once the sorting went in search times have greatly increased on the webserver.
What I'm posting below I have optimized as much as I know how, so I'm looking for some help in a better sort algorithm and maybe even a better way of querying the database anything to help speed up sort times!
Now some information about what I'm working with I needed to allow searches of 3 letters or more for example cat or car and I couldn't change the natural search word length limit for the mysql server so i can't use natural language searching of mysql hence why I am doing the queries I currently have.
Also an average search can easily return anywhere between 100-15000 results with the databases holding around 20000 entries
Any help will be greatly appreciated
<?php
require_once 'config.php';
$bRingtone = true;
$aSearchStrings = $_POST["searchStrings"];
$cConnection = new mysqli($dbhost, $dbuser, $dbpass, $dbname);
if (mysqli_connect_errno())
{
exit();
}
$sTables = array("natural", "artificial", "musical", "created");
$aQueries = array();
foreach ($sTables as $sTable)
{
$sQuery = "SELECT filename, downloadPath, description, imageFilePath, keywords FROM `$sTable` WHERE";
$sParamTypes = "";
$aParams = array();
$iCount = 0;
foreach ($aSearchStrings as $sString)
{
$sParamTypes .= "ss";
$aParams[] = "%,$sString%";
$aParams[] = "$sString%";
$sQuery .= $iCount++ == 0 ? " (keywords LIKE ? OR keywords LIKE ?)" : " AND (keywords LIKE ? OR keywords LIKE ?)";
}
array_unshift($aParams, $sParamTypes);
$aQueries[$sQuery] = $aParams;
}
$aResults = array();
foreach ($aQueries as $sQuery => $aParams)
{
if ($cStmt = $cConnection->prepare($sQuery))
{
$aQueryResults = array();
call_user_func_array(array($cStmt, 'bind_param'), $aParams);
$cStmt->execute();
$cStmt->bind_result($sFileName, $sDownloadPath, $sDescription, $sImageFilePath, $sKeywords);
while($cStmt->fetch())
{
if ($bRingtone)
{
$sFileName = $_SERVER['DOCUMENT_ROOT'] . "/m4r/" . str_replace(".WAV", ".M4R", $sFileName);
if (file_exists($sFileName))
{
$sDownloadPath = str_replace("Sounds", "m4r", str_replace(".WAV", ".M4R", $sDownloadPath));
$aResults[$sDownloadPath] = array($sDownloadPath, $sDescription, $sImageFilePath, $sKeywords, $aSearchStrings);
}
}
}
$aResults = array_merge($aResults, $aQueryResults);
$cStmt->close();
}
}
$cConnection->close();
$aResults = array_values($aResults);
function in_arrayi($needle, $haystack) {
return in_array(strtolower($needle), array_map('strtolower', $haystack));
}
function keywordSort($a, $b)
{
if ($a[0] === $b[0]) return 0;
$aKeywords = explode(",", $a[3]);
$bKeywords = explode(",", $b[3]);
foreach ($a[4] as $sSearchString)
{
$aFound = in_arrayi($sSearchString, $aKeywords);
$bFound = in_arrayi($sSearchString, $bKeywords);
if ($aFound && !$bFound)
{
return -1;
}
else if ($bFound && !$aFound)
{
return 1;
}
}
return 0;
}
usort($aResults, "keywordSort");
foreach ($aResults as &$aResult)
{
unset($aResult[3]);
unset($aResult[4]);
}
echo json_encode($aResults);
?>
Sorting large quantities of data while having to split the field code-side will be slow. Rather than optimizing, I'd seriously recommend another way of doing it, such as full-text indexing. It's really quite neat once it's working.
If full-text really isn't an option, I'd recommend splitting the keywords off into a separate table. That way, you can sort based on a count after grouping. For example ...
SELECT d.*, COUNT(k.id) AS keywordcount
FROM data d
INNER JOIN keywords k ON (d.id = k.dataid)
WHERE k.value IN ('keyword1', 'keyword2', 'keyword3')
GROUP BY d.id
ORDER BY keywordcount
On another PSish type note, you can probably speed up the thing by UNIONing the selects, followed by ordering, rather than running them all independently.

Most efficient way to compare/match two large arrays?

I am writing a very process-intensive function in PHP that needs to be as optimized as it can get for speed, as it can take up to 60 seconds to complete in extreme cases. This is my situation:
I am trying to match an array of people to an XML list of jobs. The array of people have keywords that I have already analyzed, delimited by spaces. The jobs are from a large XML file.
It's currently setup like this:
$matches = new array();
foreach($people as $person){
foreach($jobs as $job){
foreach($person['keywords'] as $keyword){
$count = substr_count($job->title, $keyword);
if($count > 0) $matches[$job->title] = $count;
}
}
}
I do the keywords loop a few times with different categories. It does what I need it to do, but it feels very sloppy and the process can take a very, very long time depending on the number of people/jobs.
Is there a more efficient, or faster, way of doing this?
$matches = new array();
foreach($people as $person){
foreach($jobs as $job){
foreach($person['keywords'] as $keyword){
$count = substr_count($job->title, $keyword);
if($count > 0) $matches[$job->title] = $count;
}
}
}
Truthfully, your method is a bit sloppy, but I assume that's because you have some specially formatted data that you have to work around? Although other than just being sloppy, I see a bit of lost data in the way you're processing things that I don't think was intentional.
I see that you're not just checking "is the keyword in the job title", but "how many times is the keyword in the job title" and then you're storing this. This means for the job title friendly friend of the friend company, the "keyword" friend shows up 3 times, and thus $matches["friendly friend of the friend company"] = 3. Since you're declaring $matches before you being your $people foreach loop, though, this means you keep over-writing this value any time a new person has that keyword. In other words, if the first person has the keyword "friend" then $matches["friendly friend of the friend company"] is set to 3. Then if the second person has the keyword "friendly", this value is over-written and $matches["friendly friend of the friend company"] now equals 1.
I think what you wanted to do was count how many people have a keyword which is contained in the job title. In this case, rather than counting how many times $keyword appears in $job->title, you should just see if it appears, and respond accordingly.
$matches = new array();
foreach($people as $person){
foreach($jobs as $job){
foreach($person['keywords'] as $keyword){
if(strpos($job->title, $keyword) !== FALSE) /* "If $keyword exists in $job->title" */
$matches[$job->title]++; /* Increment "number of people who match" */
}
}
}
Another possibility is that you wanted to know how many keywords a given person had which matched a given job title. In this case you'd want a separate array per person. This is done with a slight modification.
$matches = new array();
foreach($people as $person){
$matches[$person] = new array();
foreach($jobs as $job){
foreach($person['keywords'] as $keyword){
if(strpos($job->title, $keyword) !== FALSE) /* "If $keyword exists in $job->title" */
$matches[$person][$job->title]++; /* Increment "number of keywords which match" */
}
}
}
Or, alternatively, you could return to counting how many times a keyword matches now since per-person this is actually a meaningful value ("how well does the job match")
$matches = new array();
foreach($people as $person){
$matches[$person] = new array();
foreach($jobs as $job){
foreach($person['keywords'] as $keyword){
if($count = substr_count($job->title, $keyword)) /* if(0) = false */
$matches[$person][$job->title] += $count; /* Increase "number of keywords which match" by $count */
}
}
}
Essentially, before tackling the problem of making your loop for efficient, you need to figure out what it is your loop is really trying to accomplish. Figure this out and then your best bet for increasing the efficiency is to just decrease the number of iterations of the loop to a minimum and use as many built-in functions as possible since these are implemented in C (a non-interpreted and therefore quicker-running language).
You could use an index of the words in the job titles to make the lookup more efficient:
$jobsByWords = array();
foreach ($jobs as &$job) {
preg_match_all('/\w+/', strtolower($jobs->title), $words);
foreach ($words[0] as $word) {
if (!isset($jobsByWords[$word])) $jobsByWords[$word] = array();
$jobsByWords[$word][] = &$job;
}
}
Then you just iterate the people and check if the keywords are in the index:
foreach ($people as $person) {
foreach ($person['keywords'] as $keyword) {
$keyword = strtolower($keyword);
if (isset($jobsByWords[$keyword])) {
foreach ($jobsByWords[$keyword] as &$job) {
$matches[$job->title] = true;
}
}
}
}

Categories