Data cleaning, also termed as data cleansing or scrubbing, deals with detecting and removing errors and inconsistencies to improve the quality of data. One can perform accurate, decisive analysis or make conclusion only if the data has been passed through several data cleaning processes. It is generally observed that unstructured, messy, garbage data is the biggest deterrent to productive analysis.
This first blogpost of two part series deals with providing importance of data cleansing as well as explain different problems associated with data cleaning and their respective solutions.
Implications of Data Cleaning:
- Data Accuracy – Keeping data clean ensures accurate and up to date records. It also adds significant value to your marketing efforts and helps you understand your prospects and potential customers.
- Data Driven Decision making – In today’s environment necessity for clean and accurate information is greater than ever because businesses put major focus on data based business decisions due to its’ reliability. Moreover, insights gained from data leads to faster decisions.
- Segmentation –Greater accuracy of data assists in target segment profiling, meaning that your marketing campaigns will be more focused, relevant and subsequently more likely to be successful. Understandably, using obsolete data for your targeting activities will only lead to misdirected campaigns.
- Improved ROI – Improving the quality of data results in improved customer insight and adds considerable value to marketing efforts by improving ROI.
Simple problem example and Solution :
For our case we will focus on how to handle inconsistent data which web Analysts face more often while dealing with Google Analytics:
Data can be in form of integers, characters, measurements, observations or even just descriptions of things. If I talk with reference to Google Analytics, it might be in form of integers (for example visits, unique visitors, Page views, Time on Page, Revenue, conversion rate) or characters (for example Keyword, Geographic location, Page path and events)
Keyword is extremely imperious from business perspective since it lets you know visitor intent in form of search query. In GA data, we often come across keywords in different forms:
- Similar keyword searched by visitor that is in uppercase or lowercase
- Misspelt Keywords like “ecommererce” and “eCommerce”
- Similar words with /without additional letters like “S” (for example : “web Analytics consultant” and “web Analytics consultants”
Google Analytics considers each keyword as unique and shows individual level data for these otherwise same keywords. However you may want to look at this data at an aggregate level by grouping them together.
To sort out this data mess up issue we have several modernized solutions out there but we would prefer using Open Refine (OR). Now one question each analyst could think of is, Why Open Refine?
Open Refine is a great tool for dealing with messy data and turning it quickly into a more accurate dataset, which then allows for the fun to begin with the analysis and visualization.
Core components of Open Refine help users to correct inconsistencies, change data formats, extend data sets with data from web sources, other databases and much more. Refine also brings “a new extensions architecture, a reconciliation framework for linking records to other databases (like Freebase) and a ton of new transformation commands and expressions. This tool allows non-programmers who deal with lots of data, including students and journalists doing research, to manipulate and sort data much more quickly.
So coming back to our problem statement where we wanted to group together certain set of keywords (based on case sensitiveness, misspelling and words with/without additional letters or spaces), let us see how GR can come to our rescue.
Taking case 3 from our problem example set above where keyword is entered in two different ways: “web analytics consultant”, and “web analytics consultants “(see trailing s in second keyword). How we can group these same keywords using GR?
Open refine has a feature known as clustering which creates single group of all possible clusters. The clustering feature works by trying to group the choices in the text facet, so that choices that look similar get grouped together. For instance, operating on our example data above we get following two clusters:
web analytics consultant cluster where – No. of rows containing “web analytics consultant” keyword is – 8
web analytics consultants cluster where – No. of rows containing “web analytics consultants” keyword is – 2
Now we can group together these two clusters and make it one by typing a new name in “New Cell Value” field as shown in open refine interface diagram (A) below:
You can also select different clusters that you want to “merge” and enter a separate name of newly created group/cluster.
Moreover, you can adjust the way clustering feature groups the choices of clusters, i.e. you can define the criteria of how certain clusters look similar. When the feature is set to “conservative” – it takes into consideration same set of words even if they are ordered in random fashion (e.g. when set to conservative it might consider ” web analytics consultant ” and “consultant, web analytics ” to be similar). When the feature is “liberal” – it takes into consideration more generic words around targeted words (e.g. when set to liberal it might consider ” web analytics consultant “,” Google analytics consultant “, and ” Google Adwords consultant ” to all be similar). How conservative or liberal the feature should work depends on your data, but it is safer to be conservative and cautious in merging. Of course, you can always undo each cluster and edit operation.
In my next post, I’ll talk about more complex problems and how to resolve complexity using open refine.
Latest posts by Mrugesh Soni (see all)
- An Exclusive Guide on Building Your First Lead Scoring System - June 23, 2014
- Cleaning Google Analytics Data using Open Refine - June 4, 2013