 
				| Title: | LLMs and data privacy | 
| Subject: | Computer science, Software engineering | 
| Level: | Advanced | 
| Description: | LLMs are seeing a burst of their adoption due to their powerful capabilities to automate many tasks based on simple natural language requests (prompts). This is also the case for industrial grade application scenarios, e.g. requirements analysis, automated code generation, and more. However, LLMs have also gained attention in the security community, due to revealing vulnerabilities and/or data privacy isses. In fact, especially when using the top of the class models (e.g. Chat-GPT 5), these tools use mechanisms for self-improvement that are based on keeping trace of the submitted prompts and data and then performing corresponding analysis on those. When dealing with confindential data, the mentioned security issues represent a major hinder in the use of LLMs. As a matter of fact, companies are either forced to use corporate versions of the tools that ensure no data sharing, or they must adopt their own proprietary solution. Unfortunately, the former alternative still leaves open the risk for data leakage; the latter limits remarkably the potential performances of the local LLM due to the required computation resources needed to train and maintain these models. This thesis work investigates possible alternatives to keep data privacy without giving up the full power provided by last generation LLMs. In particular, this work will study possible solutions to: - use a data "masking" procedure to encode input data; - create prompts for LLMs using the masked data; - let the target LLM make the required computations on the masked data; - reconstruct the real data from the masked one obtained as output from the LLM. For this idea to work correctly it is essential that the masking strategy preserves the logical structure of the original data such that the obtained results can be mapped back in a reliable and correct way. Morever, the masking should not be "straightforward" (e.g. a fixed encoding strategy) to avoid the risk of someone reconstructing the original confidential data. Thesis objectives: - literature survey on existing approaches to preserve data privacy for LLM inputs; - development of a conceptual solution for handling a set of well-defined data operations and corresponding masking strategies; - a prototypical implementation of the concept to validate the proposed solutions. | 
| Start date: | 2026-01-01 | 
| End date: | 2026-06-30 | 
| Prerequisites: | - basic knowledge of LLMs; - adequate knowledge of programming languages (e.g. Python or similar); - basic knowledge of data structures and operations typically used in AI computations (graphs, vectors, etc.). | 
| IDT supervisors: | Riccardo Rubei | 
| Examiner: | Antonio Cicchetti | 
| Comments: | This thesis is suitable for 1 or 2 students. In order to make the thesis suitable for the Software Engineering specialization the data taken into account shall be linked to SE stages (e.g. requirements, testing, etc.). | 
| Company contact: |