Once the raw data has been collected and stored in a dataset that is accessible to data analysts/data scientists, the focus should shift to data cleaning and processing. This requires testing for soundness and fixing errors, designing and implementing strategies to deal with missing values and outlying/influential observations, as well as low-level exploratory data analysis and visualisation to determine what data transformations and dimension reduction approaches will be needed in the final analysis. Analysts should be prepared to spend up to 80% of their time processing and cleaning the data.
Data Science Report Series #7: Data Preparation (Draft), by Patrick Boily, Jennifer Schellinck, and Shintaro Hagiwara.