This work was funded by the National Key Technology R & D Program of China under Grant No. 2008BAH37B07, the National Natural Science Foundation of China under Grant No. 60970148, and the National Basic Research Program of China ("973" Program) under Grant No. 2007CB310806.
Cloud computing is an online network service delivery model where users obtain customized, on-demand services from the Internet. It encompasses services delivered over the Internet as well as the hardware and software in datacenters that provide such services[1]. Cloud computing is a kind of distributed computing—an evolution of parallel computing, distributed computing, and grid computing. It can be implemented as Software as a Service (SaaS), utility computing, Platform as a Service (PaaS), or Infrastructure as a Service (IaaS). Current cloud computing applications include GoogleDocs[2] and cloud service infrastructures of Microsoft and Amazon[3-4].
The main goal of cloud computing is to provide efficient computing services. One component of cloud infrastructure is the data center, which provides safe, reliable data services. Storage security, therefore, is a primary issue. A common approach to data protection is to encourage users to store their encrypted data on servers. However, as the amount of encrypted data in the cloud grows, retrieval becomes problematic.
1 Encrypted Storage Technologies in Cloud Storage Applications
Security of cloud storage applications includes authentication, data encryption and storage, security management, and security log and audit.
For authentication, Access Control Service (ACS) employs user identity authentication and authorization to prevent unauthorized access. Operations permitted on a document may be limited by the administrator or document owner, but the administrator does not have access to users’ encrypted data and can only carry out management-related operations such as user management, data backup and hotspot migration.
Encrypted storage means saving specified encrypted folders and files. Confidentiality of sensitive data must be ensured during the process of storage and transmission.
Security management involves maintaining user information and authorities, including user account registration and cancellation, authority granting, and reclaiming authority for data recovery in emergency situations.
Security logging and auditing involves recording important security events relating to users and system. This information is monitored and acted upon by the administrator.
Encrypted storage is perhaps the most important storage security service for users. It is fundamental for guaranteeing confidentiality of user data stored on a shared platform.
Networked storage systems and devices have to ensure the confidentiality of sensitive data, and also implement encrypted data sharing technology. The protection of user privacy requires that storage security is fulfilled based on a complete trust in the storage system. Therefore, encrypted storage technologies must be studied, and long-term storage and sharing mechanisms for end-to-end encrypted storage technologies and encryption keys should be supported to guarantee the confidentiality and privacy of user data and to improve the security of keys, distribution efficiency and flexibility of encryption policy. Encrypted retrieval plays a critical role in the storage of massive encrypted data; it is also a must-to-be-solved issue for encrypted storage.
2 Encrypted Information Retrieval Technologies
Study in encrypted information retrieval began in 2000. Since then, several algorithms have emerged. Song et al propose practical algorithms for searches on encrypted data[5]; Boneh et al propose public key encryption with keyword search[6]; and Park et al propose public key encryption with conjunctive field keyword search[7].
2.1 Linear Search Algorithm
In the linear search algorithm[5], a symmetric encryption algorithm is used to encrypt the plaintext. For the ciphertext of each keyword under symmetric encryption scheme, a pseudo-random sequence is generated with a length less than that of the ciphertext. Meanwhile, a check sequence is generated based on the pseudo-random sequence and the ciphertext. The sum of the lengths of the pseudo-random sequence and check sequence equals the length of the ciphertext. Finally, the pseudo-random sequence and check sequence are used to encrypt the ciphertext again by modulo 2 addition. When searching, a user submits the ciphertext sequence under symmetric encryption scheme. On the server side, modulo 2 addition with each sequence is performed to each sequence If the results satisfy the checking, the sequence is the encryption of the ciphertext under symmetric encryption; otherwise, the sequence is not the encryption of the ciphertext.
The linear search algorithm is a one-time pad-based retrieval algorithm for encrypting information, so it is strongly resistant to statistical analysis. But it has a fatal flaw in that the ciphertext must be matched each time. It is therefore not applicable to retrieval scenario with large data collection.
2.2 Public Key Encryption with Keyword Search
Boneh et al propose public key encryption with keyword search. This is applicable when storage and computing resources of a user are insufficient, and data is obtained by accessing a remote database. Storage and computing resources are often distributed asymmetrically. That is to say, a user's computation and storage capacities are not able to meet their needs. In a mobile scenario, demand for data storage and retrieval increases (for example, email service). Hence, it is necessary to protect data privacy. Encrypted data may come from different sources; public key encryption is a good encryption method that works well.
This algorithm generates a public key and private keys, then encrypts the keywords of the plaintext with the public key to generate searchable ciphertext.
2.3 Public Key Encryption with Conjunctive Field Keyword Search
Park et al propose a public key encryption method with conjunctive field keyword search. Such encryption can resist statistical attacks that trouble simpler search methods. The key for each encryption consists of a group of reverse hash sequences generated in advance, and the encrypted index is put into a Bloom filter. During the search process, the reverse hash sequence key is first used to generate multiple trapdoors, then Bloom detection is conducted. The returned encrypted document can then be decrypted.
This algorithm is a good solution for multi-user encrypted information retrieval where new users may join and old users exit. But it involves the generation of many key sequences. As the number of retrievals grows, the computational complexity increases linearly. This cannot be sustained in practice.
Retrieval models of all the above mentioned algorithms are Boolean, so retrieved documents cannot be ranked according to their relevance to query keywords, especially in cloud storage applications where large amounts of data are involved, many documents may contain one query keyword. Obtaining the most relevant documents remains a problem. It is still doubtful whether the classical vector space model can be applied for ranking the relevance of encrypted documents.
2.4 Confidentiality-Preserving Rank-Ordered Search
Swaminathan et al propose the confidentiality-preserving rank-ordered search[8]. In this algorithm, the frequencies of all keywords in a document are encrypted using the order preserving encryption. When a query is submitted to the server, the server calculates and retrieves the encrypted documents containing the ciphertext of keywords. It then ranks the corresponding ciphertext according to the frequencies of keywords; and finally, returns the encrypted document with the highest evaluation to the user for decryption.
This method can rank encrypted documents in cases where there are multiple relevant documents, and return the most relevant document. However, it is not applicable when the query contains several keywords. Also, it only uses word frequencies of the document and not the document frequency inverse to the keywords. So the vector space model can not be directly applied. A fully homomorphic encryption algorithm[9] can be used to encrypt the word frequency and solve this problem.
3 A Fully Homomorphic Encryption-Based Retrieval Method
Studies in encrypted information retrieval show that ranking of results is an important measure of the performance of a retrieval algorithm. In the scenario of secure cloud storage, the number of encrypted documents will increase exponentially. Accurate ranking is therefore an objective performance requirement for retrieval systems. The main purpose of accurate ranking is to improve retrieval efficiency and Quality of Service (QoS) of retrieval systems. Analyses show existing encrypted information retrieval algorithms can guarantee retrieval in terms of precision and recall, but do not take precision into account.
To address this problem, a fully homomorphic encryption-based retrieval method for cloud storage applications is introduced. The vector space model is used in information retrieval to calculate the relevance of retrieved documents to the query. Terms in a document are encrypted by a given public key scheme while homomorphic encryption is performed on the term frequency and inverted document frequency of keywords. Upon retrieval, the encrypted documents and index entries’ ciphertext are uploaded to the server.
Retrieval and ranking based on fully homomorphic encryption is shown in Figure 1. Before a query is submitted, the retrieval statement is split into search words and stems to generate a sequence of keywords in plaintext. The plaintext is then encrypted. When the cloud server retrieves the ciphertext sequence, it submits encrypted query words.
The document is represented by the weight vector of each term. The weight is the normalized product of word frequency and the logarithm of the inverted document frequency. It can be obtained from the term frequency and the inverted document frequency. The document vector can be calculated with Formula (1):
The query is represented in the same way. After all relevant documents have been retrieved their relevance to the query words is calculated, as the inner product of the corresponding vectors of both the document and the query. Then the relevance is ranked by the score. Finally, the ranked documents are returned to the user. Upon receiving the encrypted documents, the user decrypts them using his private key.
Plaintext data encrypted with fully homomorphic encryption algorithm can be effectively retrieved without any restoration operation; that is, the most relevant documents are returned to the user. This not only protects the user's data security, but also improves retrieval performance.
4 Conclusion
This paper analyzes the significance of retrieval technologies over encrypted information in cloud storage applications, and discusses the research status quo and problems of existing encrypted information retrieval technologies. It proposes a fully homomorphic encryption-based retrieval method, and explains its basic principles. Experimental results show this method can improve retrieval efficiency in comparison with other encrypted information retrieval algorithms.
References
[1] ARMBRUST M, FOX A, GRIFFITH R, et al. Above the Clouds: A Berkeley View of Cloud Computing [R]. Berkeley, CA, USA: University of California, 2009.
[2] Google docs - Online Documents, Spreadsheets, Present [EB/OL]. [2009-07-29]. http://docs.google.com.
[3] Windows Azure Platform [EB/OL]. [2009-07-12]. http://www.microsoft.com/windowsazure/.
[4] Amazon Elastic Compute Cloud [EB/OL]. [2009-06-24]. http://aws.amazon.com/ec2.
[5] SONG D, WAGNER D, PERRIG A. Practical Techniques for Searches on Encrypted Data [C]//Proceedings of the IEEE Symposium on Security and Privacy(S&P’00), May 14-17,2000, Berkeley, CA,USA. Piscataway, NJ, USA: IEEE, 2000: 44-55.
[6] BONEH D, CRESCENZO G, OSTROVSKY R, et al. Public Key Encryption With Keyword Search [C]//Advances in Cryptology. Proceedings of the 23rd Annual International Conference on the Theory and Applications of Cryptographic Techniques (EUROCRYPT’04), May 2-6, 2004, Interlaken, Switzerland. LNCS 3027. Berlin, Germany: Springer-Verlag, 2004: 506-522.
[7] PARK D, KIM K, LEE P. Public Key Encryption With Conjunctive Field Keyword Search [C]//Proceedings of the 2004 Workshop on Information Security Applications (WISA’04), Oct 29-31, 2004, Wuhan, China. LNCS 3325. Berlin, Germany: Springer-Verlag, 2004: 73-86.
[8] SWAMINATHAN A, MAO Y, SU G M, et al. Confidentiality-Preserving Rank-Ordered Search [C]//Proceedings of the 2007 ACM Workshop on Storage Security and Survivability (StorageSS’07), Oct 29, 2007, Alexandria, VA, USA. New York, NY, USA:ACM, 2007: 7-12.
[9] GENTRY C. Fully Homomorphic Encryption Using Ideal Lattices [C]//Proceedings of the 41st Annual ACM Symposium on Theory of Computing
(STOC’09), May 31 - Jun 2, 2009, Bethesda, MD, USA. New York, NY, USA: ACM, 2009: 169-178.
[Abstract] Problems with data security impede the widespread application of cloud computing. Although data can be protected through encryption, effective retrieval of encrypted data is difficult to achieve using traditional methods. This paper analyzes encrypted storage and retrieval technologies in cloud storage applications. A ranking method based on fully homomorphic encryption is proposed to meet demands of encrypted storage. Results show this method can improve efficiency.
[Keywords] cloud storage; vector space model; relevance ranking