Performance optimization in distributed DataWarehouses with MapReduce: Treatment of data partitioning and distribution problems
Authors: Sara Riahi , Azzeddine Riahi
Number of views: 81
We are currently facing an explosion of structured or unstructured data massively produced by different digital data sources. Firstly applications that generate data from logs, sensor networks, traces of GPS, etc., and secondly, users produce a lot of data such as photographs, videos and music. According to IBM, every hour 2.5 trillion bytes of data are generated. According to forecasts made, by 2020 this growth will be greater than 40 Zettabytes, while a Zettabyte of digital data only, were generated from 1940 to 2010. Many "inseparable" concepts currently dominate the IT field: "Cloud Computing", "Big Data", "NoSQL" or "MapReduce". Strict constraints put different researchers in the field, as for the storage and analysis of these data masses. The growth rate forecasts for the volumes of data processed exceed the limits of traditional technologies, namely relational databases or data warehouses. Our problem is how to manage the rapid growth geographically remote data volumes and how to manage this powerful scalability. What are the technologies and programming models proposed to overcome these various problems caused by this deluge of data.