云服务器价格_云数据库_云主机【优惠】最新活动-搜集站云资讯

域名交易_公司注册网站建设_价格

小七 141 0

Big Data marketing versus reality

When reading current publications, the IT world seems to have moved to Big Data. Forget about databases, with Big Data tools you can do everything – in no-time and on large volumes in addition. For every problem there is a Big Data tool: Hadoop, Spark, Hive, HBase, Kafka,…

The reality is slightly different. This post aims to explain why Big Data technology does complement databases – OLTP and OLAP alike. And that Big Data use-cases are orthogonal to Business Intelligence questions. In a follow up post, a future-proof architecture will be presented, meeting the business needs by combining various technologies in a way to get all advantages and none of the disadvantages.

The foundation of all the arguments will be the CAP theorem of distributed storage. It says that Consistency (reading the most recent data), Availability (redundant storage) and Partitioning (distributing the data) are requirements with different emphasis and a single system cannot be perfect in all three at all times. Example: If the network interconnect between two cluster nodes is down, either the query fails (missing redundancy) or there is the danger the data is not the most recent (redundant data is outdated while the interconnect is down).

Transactions versus Parallel Processing

It is important to notice above rule talks about distributed storage. But for distributed processing, variations of the CAP theorem apply as well. An obvious one: processing data in a single transaction but fully parallel with zero synchronization is a logical contradiction.

Hence, the first important difference between databases and Big Data tools is: database-like transactions are supported by databases only. For example, nobody would ever run an ERP system using Big Data technologies.

The argument can be turned upside down as well by saying that a database can be run on a cluster but it will never scale linear with the number of nodes. Twice the number of nodes does not provide twice the throughput (100% increase) but less, maybe a 50% increase? That is bound to the fact that transactions need synchronization between all nodes. As consequence, a database cluster will consist of few but powerful servers, whereas a Big Data cluster is based on a large number of small and cheap computers.

Above argument about transactions does not count if the Big Data environment is used to analyze data, but there are other reasons…

Joins versus Aggregations in the Big Data world

A computer cluster shall consist of 100 nodes and the data is evenly distributed. In such example, counting the number of rows can be distributed. Each node counts its own dataset and returns one value. Another process sums up the individual counts. Expressed in the SQL language:

select count(*) from table (partition1, partition2); and select sum(counts) from (select count(*) as counts from partition1 union all select count(*) as counts from partition2); does return the same result.

This is an example of an almost perfect workload for a cluster. Each node works on its data, each node’s resultset is tiny.

Another example shall be a join between two tables, a sales order table and the customer master data. Very likely the sales orders located on a single node will be from arbitrary customers, hence the required customer records need to be read from the other nodes prior to the join. In worst case, every single node needs to request the customer data from the 99 other nodes. In such case the entire customer data would be copied over the network 99(!) times (1/100th of the customer data moved to 99 nodes and that for each of the 100 nodes).

This join strategy is called broadcasting. It makes sense if the broadcasted table is small or better, if the entire table is available on all nodes from the get go.

The more common strategy to distribute the data is a reshuffle. Here both datasets are partitioned by the join condition and each node requests one partition. As end result, node1 would have all sales orders of customer1 and the masterdata of customer1. Now the data is all local and the join can be performed quickly. But overall the customer master data and the sales order data had to be sent across the network once to reshuffle it to the various computing nodes.

The only way to avoid reshuffling is when both datasets were partitioned by the join-condition initially, when creating and loading the table. This is a first indication that in the Big Data world the Data Model has to be purpose-built for one (type of) query.

Apache Cassandra is a good example where the data model serves one query only.

The different usage of Partitions

Above statement about purpose-built data models becomes obvious when joining the sales order data either to the customer master and/or the material master data. The data cannot be partitioned by customerID and materialID at the same time. Well, in theory via sub-partitioning it is possible technically, but by adding more and more potential columns to join with, the number of partitions grows exponentially. And the cluster will be busy reshuffling the data across the network instead of processing it.