Bachelor and Master Theses

To apply for conducting this thesis, please contact the thesis supervisor(s).
Title: Synthetic Data Generation with applications in testing and development of AI in Fashion Retail
Subject: Computer science, Embedded systems, Robotics, Software engineering
Level: Basic, Advanced
Description:

Main idea
Data-driven algorithms are becoming more and more important for retailers, since they can bring significant improvement both to supply chain efficiency and to customer experience. For this reason, the role of data collected by retailers has grown exponentially in importance. With the growth in importance, also the severity of the consequences in case of data leaks has reached new levels. Having a way to be able to develop algorithms and software without using proprietary data directly is highly desirable.

The goal of this thesis is to use the Synthetic Data Vault (SDV) for this purpose. Directly from the README.md file of SDV:

The Synthetic Data Vault (SDV) is a Synthetic Data Generation ecosystem of libraries that allows users to easily learn single-table, multi-table, and timeseries datasets to generate new Synthetic Data that has the same format and statistical properties as the original dataset.

Synthetic data can then be used to supplement, augment, and in some cases replace real data when training Machine Learning models. Additionally, it enables the testing of Machine Learning or other data-dependent software systems without the risk of exposure that comes with data disclosure.

Underneath the hood it uses several probabilistic graphical modeling and deep learning based techniques. To enable a variety of data storage structures, we employ unique hierarchical generative modeling and recursive sampling techniques.

Progression
1. Generate a dataset with a single table
o Choose 2-3 tables that present different characteristics (argument the decision!)
o Select 2-3 algorithmic configurations that could be good (argument the decision!)
o Test and draw conclusion both with model and business metric
2. Generate a dataset with multiple tables
o Choose 2-3 tables that present different characteristics (argument the decision!)
o Select 2-3 algorithmic configurations that could be good (argument the decision!)
o Test and draw conclusion both with model and business metric
3. Generate a timeseries dataset (optional)
o Choose 2-3 tables that present different characteristics (argument the decision!)
o Select 2-3 algorithmic configurations that could be good (argument the decision!)
o Test and draw conclusion both with model and business metric

In general, we should apply, in my opinion, the quality over quantity principle. We should rather do well few steps, than try to do them all but crappy.

Performance Metrics
I think we should have metrics on two-level, model metrics and business metrics.

Model Metrics
The SDV library has an evaluation framework and several metrics that capture how well the model fits the original dataset. Select the relevant procedure and metrics for the problem at hand, and always use these metrics as a necessary but not sufficient condition.

Business metrics related to the application at hand.
Would be nice to have a few metrics that measure how good are the data generated for our use cases:
• development on the local machine
• testing
• research collaborations
• publishing of open source code and data
• ...

Maybe a way could be to define a downstream task, say a simple classification/regression task, and do experiments using the real data and our synthetic generated data.

Literature
SDV - Synthetic Data Vault
The SDV github page contains good tutorials and links to scientific papers that should be enough to guide the M.Sc. project.

Start date:
End date:
Prerequisites:

AI, machine learning, programming

IDT supervisors: Mobyen Uddin Ahmed
Examiner: Shahina Begum
Comments:
Company contact:

H&M Misbah Uddin