Geo-distributed is HDFS. The main advantage for HDFS is

Geo-distributed Hadoop Distributed File System(HDFS) is widely used with datacentres around the world, with implementation in Apache. HDFS is a way to overcome dataset’s constraints with rooms to increase efficiency. Although, HDFS comes with performance issues comes with multiple datacentres. The performance can be improved with either MapReduce framework or Rout (Pig Latin) approach for a desired working order, however, there are trade-offs, either fine-tuning constraints or flexibility for these solutions therefore these solutions need improvements. The need for cloud computing increases everyday by all the electronic devices and service. However, cloud computing is not an omnipotent structure that can magically store and analyse data without any constraint. “The Cloud” is just several datacentres that bound itself geographically and legally, as in the US, the census data for each State can only be within the physical boundary of that Member State. Along with other reasons, datasets spread out to datacentres in different continents. By splitting the data, the data analysis will get more complex, thus lowering the efficiency of the current solution. With a growing demand for handling big data, to minimise the effect of cost and time with complex analytic, solutions are derived but they only address the storage of data across data centres, one of the solution is HDFS. The main advantage for HDFS is its implementation overcomes the dataset’s constraints with solutions to increase the performance. These constraints include physically, logically and technically. Physically, datasets should be store in locations, for example, multinational companies record and store their customer’s data regionally. Logically, the datasets require different entities for data’s storage. And technically, the data needs to be replicated to protect its redundancy for hardware failure. By using HDFS and using many datacentres in many locations, the data will be spreading across datacentres, resolving physical constraint. Each entity can access one of the datacentres instead of just one server far away, making the logical constraint less severe and with many datasets, even if a centre failed, the data is not destroyed. Overcome the constraints, analysis of the data will be examined for its efficiency. Given an example of finding the average age in Great Britain, divided by 4 regions, Scotland, Wales, Northern Ireland and England. With the datasets split into four datacentres, each datacentre consists 100 gigabytes of data, to calculate the average age, all the sub-datasets must be transferred back to one datacentre and then execute the calculation. This solution is costly and inefficient as transferring between datacentres usually are costly. The standard tool for dealing with this issue is MapReduce framework. This framework will divide the dataset into slices of data and process them parallelly in all sub-datasets. Implementation includes executing the job in datasets to find the result for each and combine those results. The output of this framework is smaller than the original input of each sub-dataset, reducing the cost as each calculation will use less bandwidth.With many ways to implement MapReduce, each way has its own performance characteristic, but they require complicated coding and administration without any guidance for the best practical option. A system called GM-R is developed for finding the execute path of MapReduce, by a team of researcher at Purdue University. Using a special runtime environment, GM-R evaluates the data’s characteristic, datacentre’s infrastructure, and most importantly, cost objective of the code. This system outputs to a data transformation graph(DTG) and then optimised the execution path, reduce the cost by up to 55% compared to an inefficient method. The whole approach is transparent and easily customised for increased performance, along with fine tuning the constraint of the data, the efficiency is improved. Although GM-R is complex and not flexible for implementing in a complicated job, along with not being able to join dataset is a clear trade-off for this approach.There is another approach to analysing geo-distributed big data, with flexibility is to use Rout, an extension of Pig Latin also developed by researchers at Purdue University. Pig Latin is the principle in creating relations between relations, using operators with relations to yield new relation between data points. It shows the general basis of analysing big data, while Rout is an extension to make implementation simpler in analysing big data in many datacentres. With a simpler execution, Rout clearly has an advantage in the flexible of the optimisation. Rout also accounts for resource utilisation in each datacentre, although, Rout used data-transfer heuristic, in which the filter in Rout is lower than MapReduce, making this less fine-tuned. Further testing shows that the execution time for Rout is averagely 50% lower than Pig Latin, with no outlier, making Rout an easy choice for generic big data analytics with average accuracy.As the industrial revolution marches on, demands for cloud computing will undoubtedly increase. Each of the approaches is plausible for current use. However, HDFS needed to be more and more efficient to handle the vast amount of data in the future. With two award-winning solutions from Purdue, the next question is, will that be enough to satisfy the need for big data in the future, or we must overcome the formidable challenge of combining both. Security-wise, will HDFS provide proper redundancy along with reliability or not is not mentioned in the report, making any decision for HDFS in long-term insufficient.