1. Entity Matching: developing machine learning, deep learning, and language model techniques to improve the accuracy, explainability, automation, and unsupervised evaluation of entity matching.
  2. Time Series Analytics: techniques for supporting time series analytics in real scenarios.
  3. Fairness: efficient application of mitigating techniques for guaranteeing fairness in db.
  4. Tabular Language Models: developing techniques for the automatic analysis of tabular record via Large Language Models.

Entity Matching

Development of Machine Learning and Deep Learning Techniques for Entity Matching

Understanding if entries in a dataset refer to the same real-world entity (i.e., entity matching – EM) is a challenging task even for human experts. Our research in this area concerns the development of:

  1. Explaining and making Explainable the Entity Matching process;
  2. Automatic techniques for performing EM;
  3. Unsupervised evaluation of the EM.

Explaining and making Explainable the Entity Matching process

State-of-the-art approaches based on Machine Learning (ML) and Deep Learning (DL) models are highly accurate but suffer from low interpretability. From the user’s perspective, these models act as oracles. This is a critical problem in many operational scenarios where traceability, scrutiny, and users’ confidence in the model are fundamental requirements as well as the model accuracy. The research in this area concerned:

  • A multi-facet analysis of the components of pre-trained and fine-tuned BERT architectures applied to an EM task.
    • Matteo Paganelli, Francesco Del Buono, Andrea Baraldi, Francesco Guerra: Analyzing How BERT Performs Entity Matching. Proc. VLDB Endow. 15(8): 1726-1738 (2022)
    • Matteo Paganelli, Donato Tiano, Francesco Guerra: A multi-facet analysis of BERT-based entity matching models. VLDB J. 33(4): 1039-1064 (2024)
    • Riccardo Benassi, Francesco Guerra, Matteo Paganelli, Donato Tiano: Explaining Entity Matching with Clusters of Words. ICDE 2024: 2325-2337
  • Landmark Explanation, a generic and extensible framework that extends the capabilities of a post-hoc perturbation-based explainer over the EM scenario. Landmark Explanation generates perturbations that take advantage of the particular schemas of the EM datasets, thus generating explanations more accurate and more interesting for the users than the ones generated by competing approaches.
    • Andrea Baraldi, Francesco Del Buono, Matteo Paganelli, Francesco Guerra: Landmark Explanation: An Explainer for Entity Matching Models. CIKM 2021: 4680-4684
    • Andrea Baraldi, Francesco Del Buono, Matteo Paganelli, Francesco Guerra: Using Landmarks for Explaining Entity Matching Models. EDBT 2021: 451-456

Automatic techniques for EM

The research studies the application of automated machine learning approaches (AutoML) for addressing the problem of Entity Matching (EM). This would make the existing, highly effective, Machine Learning (ML) and Deep Learning based approaches for EM usable also by non-expert users, who do not have the expertise to train and tune such complex systems. To address this issue, we introduce a new component, the EM adapter, to be pipelined with standard AutoML systems, that preprocesses the EM datasets to make them usable by automated approaches.

  • Matteo Paganelli, Francesco Del Buono, Marco Pevarello, Francesco Guerra, Maurizio Vincini: Automated Machine Learning for Entity Matching Tasks. EDBT 2021: 325-330

Evaluating the EM process

Evaluation is a bottleneck in data integration processes: it is performed by domain experts through manual onerous data inspections. This task is particularly heavy in real business scenarios, where the large amount of data makes checking all integrated tuples infeasible. Our idea is to address this issue by providing the experts with unsupervised measures. The research is done with the University of Padua.

  • Matteo Paganelli, Francesco Del Buono, Francesco Guerra, Nicola Ferro: Evaluating the integration of datasets. SAC 2022: 347-356

Time series analytics

Developing techniques for (Multivariate) Time Series Analytics. Our research in this area concerns the development of:

  1. Explainable techniques for clustering time series
  2. Forecasting irregularly sampled time series
  3. Preprocessing the datasets for the application of self-supervised learning

Explainable techniques for clustering time series

Development of Time2Feat is an end-to-end machine learning system for multivariate time series clustering that enhances accuracy and interpretability by extracting interpretable features, applying dimensionality reduction, and allowing domain specialists to semi-supervise the process with a small set of labeled data.

  • Angela Bonifati, Francesco Del Buono, Francesco Guerra, Miki Lombardi, Donato Tiano: Interpretable Clustering of Multivariate Time Series with Time2Feat. Proc. VLDB Endow. 16(12): 3994-3997 (2023)
  • Angela Bonifati, Francesco Del Buono, Francesco Guerra, Donato Tiano: Time2Feat: Learning Interpretable Representations for Multivariate Time Series Clustering. Proc. VLDB Endow. 16(2): 193-201 (2022)

Forecasting irregularly sampled time series

The development of techniques for forecasting in irregularly sampled time series focuses on handling uneven intervals, missing values, and gaps to improve prediction accuracy through methods like interpolation, resampling, and specialized machine learning algorithms.

Application of self-supervised learning

The research aims to bridge the gap between self-supervised and supervised techniques in time series forecasting.


Fairness

Efficient application of mitigating techniques for guaranteeing fairness in db. The research is about the development of an enhanced algorithmic mitigation technique for incorporating fairness constraints into machine learning models, offering flexibility in fairness metrics, improved scalability, and faster execution while maintaining accuracy and fairness guarantees.

  • Andrea Baraldi, Matteo Brucato, Miroslav Dudík, Francesco Guerra, Matteo Interlandi: FairnessEval: a Framework for Evaluating Fairness of Machine Learning Models. EBDT 2025 (demonstration)

Tabular Language Models

The research aims to develop techniques for the automatic analysis of tabular records using Large Language Models (LLMs), focusing on leveraging LLMs to understand, interpret, and extract insights from structured data in tables. The scenario we are investigating involves analyzing non-financial data from companies, specifically related to Environmental, Social, and Governance (ESG) factors, to uncover trends, assess corporate responsibility, and generate actionable insights for sustainability and governance practices.