Abhishek Mishra is a Student of HNLU Raipur
Keywords
Artificial Intelligence, Federative Learning, Data Protection, Finance
Introduction
The world has seen a rapid rise in artificial intelligence (“AI”) technology in the past two years. The cut-throat competition in the development of AI systems has created high demand for a diverse set of data for training AI models. India, a country of 720 million internet users, has proven to be a goldmine for data. It is a fact that the Indian government (“CG”) very well knows as can be inferred from its plan to create a large database of anonymized non-personal data. Further, it is no surprise that AI models like those of OpenAI and TikTok have immensely benefitted from the training datasets that they were able to build with user interactions in India. In light of these facts, there should be no doubt that data is being centralised for the training of AI models. Certain sensitive personal data should not be allowed to be kept in silos due to security and privacy risks. This blog article suggests federated learning (“FL”) as a mandatory measure to be employed in certain data-sensitive sectors of the economy, under the Digital Personal Data Protection Bill, 2022 (“Bill”).
Training of AI Models
AI systems are based on machine learning (“ML”) algorithms. The fundamental base of complex AI systems/models remains the web of neural networks that enable deep learning. Neural networks are nothing but an artificial manifestation of the human brain’s cognitive abilities. This network accepts data and learns to process it as per assigned weights and biases for individual labels or attributes. A continuous stream of data when fed to the networks, leads to the training of AI models by increased accuracy of weights and biases to be assigned. This process of data feeding allows the formation of parameters of the AI system.
However, the learning process does not stop with the release of the AI system in the market. The models keep on improving with the data that their interactions with the users provide. In this continuous learning process, the concern is the centralised nature of this training dataset. The model sends this data to a central database and learns in real-time in a global environment. In such a case, it is natural that risks related to data privacy shall arise. This risk is exacerbated by the vulnerability of AI models to cyber-attacks.
OpenAI, the developer of ChatGPT, admits that it retains ‘certain data’ from the interactions of users with its ChatGPT and DALL-E models. However, it claims that it takes steps to reduce the amount of personal information from the collected dataset before it is used to improve its models. However true that claim might be, the data is still kept in a silo, locked away in the possession of a corporate entity. This is certainly not the ideal circumstance as far as data privacy and security are concerned. One tool to mitigate these risks can be found in federative learning regimes.
FL: What is it and how can it help?
FL as a concept is not new, in fact, it was first introduced by Google in 2016. In an FL regime for ML models, a user’s data never leaves the interface device/network. A default model is transmitted to the users, who then train it using their personal data, either intentionally or as part of a service agreement. This trained model is then transferred to a higher node. In a secure architecture, this higher node will not consider the individual model update before averaging it with at least a thousand or more such trained models.
What is transferred in an FL environment is not the personal data of the user but the trained parameters of the ML model. Thus, for example, the model may learn from the data of one user the weight that it is supposed to assign to a particular label or attribute. Such a parameter of weight is then transferred to a higher echelon or node. In such an environment, the user’s personal data does not get transferred and multiple users across the network can collaboratively train the AI model parameters.
These trained models are then transferred through encoded protocols that make it impossible to identify the original training interface/user. Such an algorithm shall be in full compliance with even the most stringent of data protection laws like the General Data Protection Regulations and yet will enable the construction of cross-industry, cross-data, and cross-domain AI systems. Thus, it is no surprise that companies like Apple are working on FL architectures so that data used to improve their apps never leaves the device. Further, the decentralised learning environment of FL shall ensure greater variance of data resulting in efficient generalisation and globalisation of AI system training. The system needs to communicate and aggregate the model updates in a secure, efficient, scalable, and fault-tolerant way. It is only the combination of research with this infrastructure that makes the benefits of FL possible.
It would be pertinent to point out that FL alone is not a guarantee of data privacy. However, when combined with measures such as differential privacy and homomorphic encryption, FL can significantly mitigate privacy risks. Naturally, establishing secure multiparty computation architectures is of high significance. Further, it is possible that in some cases the trained model may have personal data embedded in its code. However, it shall likely be in a much-reduced form and pose a much lower privacy risk than the original data.
Implementation in specific sectors
Certain sectors like healthcare and finance can be highly benefitted from AI but data privacy concerns do not allow widespread deployment of AI-enabled systems in these sectors. Of these, the most important remain the healthcare and finance sectors. We shall now examine the possible working of FL in these two systems in brief.
Healthcare: There is no doubt that healthcare data is highly sensitive personal information and it cannot be shared with centralised databases maintained by third-party entities. However, with FL, AI systems can be trained by the health records of an individual or a small group of individuals at a local node.
For instance, a region of hospitals may decide to collaborate on their analytics in order to maximize the potential of their data set. The hospitals will develop a centralized server to perform all the computation, and all the hospitals will simply send their local weights to a node (or directly to a server). In the case of a node, the optimally trained model may then be securely transferred for collaborative learning with other local models from throughout the operational area of the AI system. The server will develop the optimal generalized model and distribute the global model to all parties. No party will interface with the data, but after global computation, each client will receive only the result of the optimal model. This process shall ensure that the data of the individuals remain either with them or with the hospital (or any other local organisation that may form the basic node).
The best use of this regime can be on AI systems that are designed for the diagnosis of symptoms and diseases. Further, information from various genetic and geographical demography would result in more accurate and generalised training of the model. GDPR’s Article 9 explicitly prohibits the processing of data related to the health of the data subject, subject to only a few exceptions, making it clear that centralisation of medical history cannot be allowed. The position in India should not be any different.
Finance: Let us examine one more sector with heightened data privacy and security risks. In the finance industry, customers expect high standards in terms of data protection and the integrity of their personal data. From a regulatory point of view, violations of the guidelines also have severe financial consequences. Nevertheless, from the point of view of value creation, it is essential for banks to evaluate customer data using statistical methods and algorithms. The banks are thus in a conflict between maintaining data protection and enforcing their own business model. Here, a mandatory FL architecture may prove to be a game changer.
The trend toward business ecosystems has existed for years, including in the financial industry. In an ecosystem, the exchange of data is the focus. The basic principles of open banking are the use of standardized interfaces (i.e. APIs) for sharing data and their joint use. Sharing data leads to new services and has several advantages for the participants in such an ecosystem. For example, fraud detection processes, lending, or personalized services can be offered based on consolidated data. But this is exactly where one of the biggest challenges of open banking lies: It is extremely important that customer data (personal data) is treated and evaluated in a trustworthy manner. It is therefore not possible to simply send and evaluate this data between banks.
The use of FL will enable the application of machine learning methods to customer data while preserving privacy. The bank takes on the role of a decentralized computing unit and evaluates its own local data. Only the generated AI models then leave the bank after the model training and are made available to the participants in Open Banking on a central marketplace. In this way, the participants in the Open Banking Ecosystem receive joint insights from the data and can incorporate them into their own value creation.
A mandatory FL regime?
Having discussed the benefits of FL, we may now consider the viability of a mandatory FL regime in targeted sectors. There is no doubt that an optimally trained AI model that has been exposed to diverse datasets shall be making invaluable contributions to the advancement of specific sectors of the economy. It must be noted that healthcare and finance are merely two such examples of a sector of the economy that involves highly sensitive personal data and shall benefit from the wider adoption of AI-based APIs. Research may identify other similar areas that may benefit from free access to personal data under an FL regime.
There is a unique opportunity available to the CG to utilise delegated legislation to bring sector-specific regulations that simultaneously balance privacy rights while spurring economic growth. It is a general consensus that the 2022 Bill is primarily a skeletal legislation that confers the CG with extensive rule-making powers. While the extensive delegation has been widely criticised, it does provide the opportunity for targeted regulations. The CG ought to use its powers under Clause 26 of the Bill to introduce regulations specific to the sectors identified above and others, much akin to the Information Technology (Reasonable Security Practices and Procedures and Sensitive Personal Data or Information) Rules, 2011.
It must be noted that added benefits of such regulations may include low bandwidth utilisation, thus freeing up a large amount of internet traffic. It also shall encourage competition and innovation as FL architectures save a significant amount of capital investment that is required to set up data silos. This is in addition to the fact that since the calculation and training shall be performed in the end device (or the device at the local level) itself, there is no longer a requirement for high-performance servers. It shall naturally benefit start-ups and small technology firms, which is one of the stated goals of the Digital India programme and the upcoming Digital India Act.
Conclusion
Developers have been using centralised data to train AI models after the production stage, triggering data privacy concerns. A viable solution can be found in the concept of FL that has been successfully deployed to train AI-based applications like Google’s Gboard. FL uses the end device to train a default AI model and uploads only the parameters to a higher node. The central server aggregates the received models and issues a global update of an efficient and well-trained model. The CG ought to use the vast rule-making power it has under the DPDP Bill to make sector-specific regulations that mandate the use of FL while training AI models that operate in an environment with highly sensitive personal data such as healthcare and finance. Further research may be conducted to identify other relevant sectors where a centralised database poses a significant threat to data security and privacy, and the feasibility of FL in them.