Clustering Basics

No Masters

In BlobCity’s clustering architecture there are no masters. All nodes in a cluster are equal.
Queries on a cluster can be initiated on any node in the cluster. The node on which the
query is executed becomes the master node for that query.


Heterogeneous Nodes

Nodes in BlobCity do not have to all be of the same size and configuration. Each node in
the cluster could be of a different CPU, RAM and Disk size.


Initiate a query on any node

Since all nodes are equal, a transaction or database query can be initiated on any of the
nodes. The node that receives the query becomes the master node for that query. The job
of the query master is to handle the query request and query response to the requesting
application. It is also responsible for routing the query to other nodes and executing the
query on all nodes in the cluster. If the query requires aggregation or sorting operations,
the same are performed by the query master.

All Queries Run on all Nodes

In BlobCity all queries are run on all nodes. It is the job of the query master to make sure
each query runs on every other node in the cluster. Each node responds with its own
response for the query. The query master then aggregates the responses and returns an
aggregated response to the requesting application.

Distributed Aggregations

Aggregation queries such as SUM, COUNT, AVERAGE are all run in a fully distributed
manner inside BlobCity. Each node individually aggregates data present on itself and
returns the aggregate value to the query master. The query master then aggregates the
returned aggregates and send back the final aggregates to the requesting application.

Partially Distributed Sorting

Queries that require sorting of responses, execute the sorting partially on the individual
nodes in the cluster followed by final sorting on the query master. Each node responds
back with a sorted sub-set of part of the response. The query master then performs a full
sorting after receiving individually sorted responses from each node in the cluster.

The query master should be a sufficiently sized node to hold all response data in-memory
and must have good compute to perform fast sorting of all data. The query master
essentially must do a full sort of on all the data received from the cluster nodes. However
since each node is returning a sorted response, it does save a few sort iterations.

Initiate Demanding Queries on Highest Configuration Nodes

When requesting large data as query response like in the case of select * from table
queries or fetching data for large time ranges like a year’s worth of data or more, it is
important to run such queries on the node that has the maximum RAM and maximum
compute. The query master essentially needs to save the complete query result-set in
memory before it can return the same to the application. If the query master does not have
enough RAM space to save the complete query response, the query may never succeed, or
the node may cause excessive disk-io by using the computers swap space to execute the
query, thereby significantly slowing down other disk based operations.

When running sort queries on large response sets, the CPU power available on the query
master is also important. It is recommend to run the sort queries on the nodes with the
highest amount of CPU power and having sufficient RAM to hold the complete query
result set.