[ Pobierz całość w formacie PDF ]
.With tens,hundreds, or even thousands of machines making up a Hadoop cluster, machinesand especially hard disks fail at a significant rate.It s not uncommon to find thatapproximately 2% to 5% of the nodes in a large Hadoop cluster have some kind offault, meaning they are operating either suboptimally or simply not at all.In additionto faulty servers, there can sometimes be errant user MapReduce jobs, network failures,and even errors in the data.Child task failuresIt s common for child tasks to fail for a variety of reasons: incorrect or poorly imple-mented user code, unexpected data problems, temporary machine failures, and ad-ministrative intervention are a few of the more common causes.A child task is con-sidered to be failed when one of three things happens:" It throws an uncaught exception." It exits with a nonzero exit code." It fails to report progress to the tasktracker for a configurable amount of time.When a failure is detected by the tasktracker, it is reported to the jobtracker in the nextheartbeat.The jobtracker, in turn, notes the failure and if additional attempts are per-mitted (the default limit is four attempts), reschedules the task to run.The task maybe run either on the same machine or on another machine in the cluster, depending onavailable capacity.Should multiple tasks from the same job fail on the same tasktracker36 | Chapter 3: MapReducerepeatedly, the tasktracker is added to a job-level blacklist that prevents any other tasksfrom the same job from executing on the tasktracker in question.If multiple tasks fromdifferent jobs repeatedly fail on a specific tasktracker, the tasktracker in question isadded to a global blacklist for 24 hours, which prevents any tasks from being scheduledon that tasktracker.Tasktracker/worker node failuresThe next obvious failure condition is the loss of the tasktracker daemon or the entireworker node.The jobtracker, after a configurable amount of time with no heartbeats,will consider the tasktracker dead along with any tasks it was assigned.Tasks are re-scheduled and will execute on another tasktracker; the client application is completelyshielded from the internal failure and as far as it s concerned, the job appears to simplyslow down for some time while tasks are retried.Jobtracker failuresThe loss of the jobtracker is more severe in Hadoop MapReduce.Should the jobtracker(meaning either the process or the machine on which it runs) fail, its internal state aboutcurrently executing jobs is lost.Even if it immediately recovers, all running tasks willeventually fail.This effectively means the jobtracker is a single point of failure (SPOF)for the MapReduce layer a current limitation of Hadoop MapReduce.HDFS failuresFor jobs whose input or output dataset is on HDFS, it s possible that HDFS couldexperience a failure.This is the equivalent of the filesystem used by a relational databaseexperiencing a failure while the database is running.In other words, it s bad.If a da-tanode process fails, any task that is currently reading from or writing to it will followthe HDFS error handling described in Chapter 2.Unless all datanodes containing ablock fail during a read, or the namenode cannot find any datanodes on which to placea block during a write, this is a recoverable case and the task will complete.When thenamenode fails, tasks will fail the next time they try to make contact with it.Theframework will retry these tasks, but if the namenode doesn t return, all attempts willbe exhausted and the job will eventually fail.Additionally, if the namenode isn t avail-able, new jobs cannot be submitted to the cluster since job artifacts (such as the JARfile containing the user s code) cannot be written to HDFS, nor can input splits becalculated.YARNHadoop MapReduce is not without its flaws.The team at Yahoo! ran into a number ofscalability limitations that were difficult to overcome given Hadoop s existing archi-tecture and design.In large-scale deployments such as Yahoo! s Hammer clustera single, 4,000-plus node Hadoop cluster that powers various systems the team foundYARN | 37that the resource requirements on a single jobtracker were just too great.Further, op-erational issues such as dealing with upgrades and the single point of failure of thejobtracker were painful.YARN (or Yet Another Resource Negotiator ) was created toaddress these issues.Rather than have a single daemon that tracks and assigns resources such as CPU andmemory and handles MapReduce-specific job tracking, these functions are separatedinto two parts [ Pobierz całość w formacie PDF ]