Data quality challenges with missing values and mixed types in joint sequence analysis

Date Published	12/2017
Publication Type	Conference Paper

Authors	Alina Lazar Ling Jin C Anna Spurlock Annika Todd-Blick Kesheng Wu Alex Sim
DOI	10.1109/BigData.2017.8258222
Abstract	The goal of this paper is to investigate the impact of missing values in categorical time series sequences on common data analysis tasks. Being able to more effectively identify patterns in socio-demographic longitudinal data is an important component in a number of social science settings. However, performing fundamental analytical operations, such as clustering for grouping these data based on similarity patterns, is challenging due to the categorical and multi-dimensional nature of the data, and their corruption by missing and inconsistent values. To study these data quality issues, we employ longitudinal sequence data representations, a similarity measure designed for categorical and longitudinal data, together with state-of-the art clustering methodologies reliant on hierarchical algorithms. The key to quantifying the similarity and difference among data records is a distance metric. Given the categorical nature of our data, we employ an “edit” type distance using Optimal Matching (OM). Because each data record has multiple variables of different types, we investigate the impact of mixing these variables in a single similarity measure. Between variables with binary values and those with multiple nominal values, we find that the ability to overcome missing data problems is harder in the nominal domain versus the binary domain. Additionally, artificial clusters introduced by the alignment of leading missing values can be resolved by tuning the missing value substitution cost parameter.
Proceedings Title	2017 IEEE International Conference on Big Data (Big Data)2017 IEEE International Conference on Big Data (Big Data)
Conference Name	2017 IEEE International Conference on Big Data (Big Data)2017 IEEE International Conference on Big Data (Big Data)
Year of Publication	2017
Publisher	IEEE
Conference Location	Boston, MA, USA
Organizations	Energy Markets and Planning Department Energy Technologies Area Energy Analysis Division
Research Areas	Sustainable Energy Systems Energy & Data Behavior Analytics EA Energy Markets & Planning Metrics and evaluation Efficiency & Load Flexibility Energy Affordability Behavior Analytics 2
Download citation	Google Scholar \| DOI \| BibTeX \| Endnote tagged \| RIS