A Distributed In-Memory Database Solution for Mass Data Applications

Release Date:2010-12-20 Author:Dong Hao, Luo Shengmei, Zhang Hengsheng

    Analyzing and processing mass data is a common function of telecommunications and Internet service applications. As the number of service users increases, traditional on-disk database systems struggle to satisfy demand for mass data processing. Typical service applications include fast querying of large-scale user databases for social networking services, data processing of mass logs, and data analysis and mining of mass databases. In these tasks, response time plays a critical role.


    An in-memory database that relies on main memory for computer data storage has been widely used in recent years. In contrast to database management systems, which employ a disk-optimized storage mechanism, main memory databases are faster. Internal optimization algorithms of main memory databases are simpler and execute fewer CPU instructions. Accessing data in the main memory is faster and more predictable than accessing data on disk[1]. Storage services require high availability, good performance, and strong consistency[2]. In-Memory Databases (IMDBs)[3] satisfy these requirements and have emerged as a way of improving the performance of short transactions[4]. Since IMDBs can be accessed directly from the memory, response time is quicker, and transaction throughput is improved when compared to a Disk-Resident Database (DRDB). This is especially important for real-time applications where transactions need to be completed within a specified timeframe[5].


    The number service application users is many times higher than in the past. But memory capacity and CPU processing limitations on a single computer means the IMDB system often cannot deal with the mass data in these applications. For example, the main memory in a computer is normally 4 GB to 8 GB, while 100 GB may be needed for a database of 100 million users where every user record needs 1 Kb. In many cases, the IMDB system on a single computer cannot store all the data of certain types of applications. For some applications, logic processing is complicated and processing time is lengthy. So assigning all these  processing tasks to one computer is not ideal.


    To improve data processing efficiency, a Distributed In-Memory Database (DIMDB) system is proposed. The DIMDB system in this paper supports expanded Structured Query Language (SQL) grammar with a key-value storage schema.


1 Design Goals of DIMDB System
    The design of DIMDB has three goals:


    (1) High Performance
    A DIMDB system should be capable of high-performance data access and processing. Many current telecom and Internet service applications produce mass data every day, and this data should be easily accessible and efficiently processed. Mass data might include detailed call records of telecommunications systems, user subscription information, web access logs of Internet Service Providers (ISPs), and monitoring data derived from sensor networks. These kinds of mass data could be stored in a DIMDB system for ease of access and high-performance processing.


    (2) High Scalability
    DIMDB is a distributed system with multiple data nodes to accommodate mass data applications. As the amount of data and processing increases, new data nodes can be added online without interrupting the running service. If the DIMDB system has more data nodes than necessary, redundant nodes can be removed.


    (3) High Reliability
    Data stored in one node always has one or more duplicate copies in other nodes. If a data node fails, an application can use duplicate data on other nodes. The DIMDB management node is an active-standby system for Home Agent (HA).


2 Architecture of a DIMDB System
    Figure 1 shows the basic architecture of a DIMDB system consisting of three elements: DIMDB-Master, DIMDB-Client, and DIMDB-Data Node (DIMDB-DN).

 

 

 

 

2.1 DIMDB-Master
    The DIMDB-Master:

  • Allows applications to configure data distribution policies;
  • Allows DIMDB-Client to query configured policies;
  • Activates or deactivates DIMDB-DNs and queries the status of DIMDB-DNs.

 

 

    The functional structure of DIMDB-Master is shown in Figure 2.

 


    There are five modules in DIMDB-Master: main control module, http communication module, configure module, node management module, and policy storage.


    The main control module is the key functional module of DIMDB-Master. It interacts with other modules to complete essential functions of DIMDB-Master.


    The policy storage is used to save data distribution policies.
The http communication module provides M1 interface for interaction with DIMDB-Client. M1 interface is used to receive query requests from DIMDB-Client and to send data distribution information to DIMDB-Client.


    The configuration module interacts with the external operation and maintenance platform to configure data distribution policies on each DIMDB-DN through the M2 interface.


    The node management module connects to each DIMDB- DN using M3 interface. It monitors the status of DIMDB-DNs, and activates or deactivates DIMDB-DNs. Status information includes CPU efficiency, the efficiency and capacity of memory used, and whether the DIMDB-DN is operational or non-operational.

 

 

2.2 DIMDB-Client
    The functional modules of DIMDB-Client are shown in Figure 3.

 


    The API module in DIMDB-Client is used by upper-level applications to call the data operation functions of DIMDB-Client.


    The statement parser is used to transform received SQL-like statements into two parts: database operation statement and input condition (used for requesting data distribution information).


    The execution engine is a dynamic link library connected to DIMDB-ND. It is provided by the database management system on a data node, and is used to bring the statement into operation.


    The policy acquisition module is used to obtain the data distribution policy from DIMDB-Master according to the input condition.

 

2.3 DIMDB-DN
    DIMDB-DN is a conventional in-memory database used for data storage. It is a stable and efficient database management system supporting Open Database Connectivity (ODBC) and Java Database Connectivity (JDBC) interfaces.


3 Enhanced SQL-Like Language and Operation Flows

 

3.1 Enhanced SQL-Like Language
    Data operation and query in the proposed system is carried out using an enhanced SQL-like language. This language mimics SQL syntax for creating tables, loading data into tables, and querying tables. Enhanced SQL-like language also allows data distribution information to be embedded into statements. When the DIMDB-Client receives an SQL-like statement, the statement parser transforms it into a normal SQL statement and input condition. The input condition is a kind of key-value range pair. The policy acquisition module then sends the condition to DIMDB-Master, and receives data distribution DIMDB-DN information through C2 interface. This information includes node IP addresses and ports oriented to each key-value sub-range pair. According to the information received, DIMDB-Client rewrites the SQL statement and divides it into sub-statements. DIMDB-Client connects to DIMDB-DNs, executes the statements through the execution engine, and collects the data operation results.


    Enhanced SQL-like language used in the proposed system includes the condition for data distribution information. For example, the following statement creates a TABLE t1 with additional conditions:


    CREATE TABLE t1(Index int not null, Name char(50) not null, Age int, Height float){Key=Index}         (1)


    The following statement inserts a record into TABLE t1:


    INSERT INTO t1 (Index, Name, Age, Height) values (1001, "Tom", 20, 1.72) {Key=Index}               (2)


    When receiving statements (1) and (2), DIMDB-Client abstracts and sends the key-value pair to DIMDB-Master, which returns the data node information according to the key-value pair. Then DIMDB-Client performs the data operations on DIMDB-DNs according to the received data node information.


    If the application needs to query data in TABLE t1, a statement is sent to DIMDB-Client as follows:


    SELECT ·FROM t1 WHERE Age=25{Key=Index, Minvalue=1001, Maxvalue=1100}                (3)


    After receiving query statement (3), DIMDB-Client abstracts the key-value pair Key=Index, Minvalue=1001, Maxvalue=1100 and sends it to DIMDB-Master. DIMDB-Master then returns the data node information  DIMDB-ND_1 and DIMDB-ND_2, and corresponding key-value sub-range pairs Key=Index, Minvalue=1001, Maxvalue=1050; and Key=Index, Minvalue=1051, Maxvalue=1100. Then DIMDB-Client divides and rewrites the query statement according to the received data node information. The first sent to DIMDB-ND_1 is:


    SELECT ·FROM t1 WHERE Age=25 and (Index>=1001 and Index<1050)                        (4)
And the second sent to DIMDB_ND_2 is:
SELECT ·FROM t1 WHERE
Age=25 and (Index>=1050
and Index<=1100)                       (5)


    After obtaining query results from the two DIMDB-DNs, DIMDB-Client combines the results and returns them to the application.

 

3.2 Operation Flows
    The operation flow of the DIMDB system is shown in Figure 4.

 


    Step 1: The application calls the API interface in DIMDB-Client with the parameter of an enhanced SQL-like statement.


    Step 2: The statement is sent to the statement parser module.


    Step 3: After being analyzed, the normal SQL statement is saved, and the key-value range pair is sent to the policy acquisition module.


    Steps 4 and 5: The policy acquisition module interacts with DIMDB-Master to obtain the data distribution information. The input is the key-value range pair, and the output is the IP addresses and ports of DIMDB-DN for each divided key-value sub-range pair.


    Step 6: The data distribution information is sent to API.


    Step 7: The API composes multiple SQL statements and calls the function of the execution engine with the parameters of new statements and data node information.


    Step 8.1-8.n: The execution engine connects to multiple DIMDB-DNs and executes the SQL statements.


    Step 9.1-9.n: The execution engine obtains the data results.


    Step 10: The execution engine sends the data results to API.


    Step 11: API combines the all data results and returns them to the application.
With these 11 steps, the operation is finished.


4 Experiments
    In this section, some experiments are provided to evaluate the performance and scalability of the DIMDB system.

 

4.1 Experiment Environment
    The configuration of the experiment environment is listed in Table I.

 

 

4.2 Experimental Results
    The bar chart in Figure 5 shows the query time for different total data scales, being 100 k, 1 M and 10 M records in the databases of all DIMDB-DNs. Different colored columns denote different scales. The query efficiency for large-scale data in the DIMDB system is notably better than in traditional databases.

 


    As shown in Figure 6, query time for different DIMDB-DNs in the system is different. The query time for  DIMDB-DN 3 is notably shorter than that for DIMDB-DNs 1 and 2. When the scale of data is increased,  more DIMDB-DNs should be added to the system.

 


    The above illustrates the effectiveness of the prototype in this paper. Efficiency of the DIMDB system is expected to improve with optimizations such as table indexing, and more reasonable data distribution policies. When expanded to hundreds of data nodes, the DIMDB system could be used in telecommunications and Internet applications requiring mass data processing.


5 Conclusion
    In this paper, limitations of current in-memory database systems are analyzed, and an a DIMDB system is proposed to improve data processing ability in mass data applications. An enhanced language similar to SQL is used, which has the advantage of a key-value storage schema. DIMDB could be widely used in applications that require mass data processing. Future research will need to be conducted into optimizing data distribution policies and table dividing schemas, and improving overall stability of the system.

 

    Acknowledgment: Thank you to Tang Jue for his support and patient guidance, and to Wang Zhiping, Zhou Yang, Ye Xiaowei and Lin Xiangdong for their contributions to this work.

 

References
[1] Telecommunication Systems Signs up as a Reseller of TimesTen; Mobile Operators and Carriers Gain Real-Time Platform for Location-Pased Services [N]. Business Wire, 2002-06-24.
[2] CAMARGOS I, PEDONE F, SCHMIDT R. A Primary-Backup Protocol for In-Memory Database Replication [C]//Proceedings of 5th IEEE International Symposium on Network Computing and Applications (NCA’06), Jul 10-12, 2008, Cambridge, MA, USA. Los Alamitos, CA, USA: IEEE Computer Society, 2006: 204-211.
[3] GARCIA-MOLINA H, SALEM K. Main Memory Database Systems: An Overview. IEEE Transactions on Knowledge and Data Engineering, 1992, 4(6): 509-516.
[4] BLOTT S, KORTH H F. An Almost-Serial Protocol for Trans-Action Execution in Main-Memory Database sSystems [C]//Proceedings of the 28th International Conference on Very Large Data Bases (VLDB’02), Aug 20-23, 2002, Hong Kong, China. San Francisco, CA, USA: Morgan Kaufmann Publishers, 2002: 706-717.
[5] DENG Kun, LIAO Guoqiong, HUANG Yukun, et al. A Novel Storage Architecture of In-Memory Databases Supporting Real-Time E-Commerce Applications [C]//Proceedings of the International Conference on Management of E-Commerce and E-Government (ICMECG’08), Oct 17-19, 2008, Nanchang, China. Washington, DC, USA: IEEE, 2008: 252-256.

 

 

 

 

 

 

 

 

 


    There are five modules in DIMDB-Master: main control module, http communication module, configure module, node management module, and policy storage.


    The main control module is the key functional module of DIMDB-Master. It interacts with other modules to complete essential functions of DIMDB-Master.


    The policy storage is used to save data distribution policies.
The http communication module provides M1 interface for interaction with DIMDB-Client. M1 interface is used to receive query requests from DIMDB-Client and to send data distribution information to DIMDB-Client.


    The configuration module interacts with the external operation and maintenance platform to configure data distribution policies on each DIMDB-DN through the M2 interface.


    The node management module connects to each DIMDB- DN using M3 interface. It monitors the status of DIMDB-DNs, and activates or deactivates DIMDB-DNs. Status information includes CPU efficiency, the efficiency and capacity of memory used, and whether the DIMDB-DN is operational or non-operational.


    The DIMDB-Client generally receives data processing instructions and access requests from upper-level applications. After analyzing these instructions, it sends the requests to DIMDB-Master for data distribution information, and returns the information according to the data distribution policy. The DIMDB-Client then sends commands to the specific DIMDB-DN for processing and data access.

[Abstract] In this paper, a Distributed In-Memory Database (DIMDB) system is proposed to improve processing efficiency in mass data applications. The system uses an enhanced language similar to Structured Query Language (SQL) with a key-value storage schema. The design goals of the DIMDB system is described and its system architecture is discussed. Operation flow and the enhanced SQL-like language are also discussed, and experimental results are used to test the validity of the system.

[Keywords] distributed in-memory system; enhanced key-value schema; mass data application