This was the start of the Data Science Delhi Meetup group and the first meetup at IndicInfo’s new location at Spaze iTech Park. Despite the day falling in the middle of a long weekend, 10 enthusiasts made it to the meetup to discuss their ideas about Data Science.
The meetup comprised of multiple sessions / presentations.
Rajat Bhalla talked about the initial days of BigData as ETL and Data Warehousing technologies in which he covered the design on Enterprise Data Warehouse and how it helps large enterprises.
Anurag Shrivastava explained where data science can be applied in business by giving examples from insurance industry. Anurag also explained the importance of access to high quality historical data to help data science tools to work.
There was a Q&A session with Narinder Kumar who is a Hadoop trainer and R programmer having long Java programming experience. Narinder covered techniques such as classification and regression after explaining the meaning of machine learning at great length.
Details on the content presented are mentioned below.
Session: Data Warehouse and ETL – Rajat Bhalla
BigData has emerged recently but it has its routes in data warehouses that were prevalent since 1970s. They were handling traditional data (data collated from various transactional systems) ever since but with the arrival of new types of data (blogs, videos, data from social networks, etc.) on the horizon, the traditional warehousing and analytical approaches started to become insufficient. That is when BigData arrived on the scene and glamourised everything. But the foundation of BigData is still in data warehouses. So how about a little tour on data warehouses and ETL.
A Data Warehouse is essentially a relational database but it is different from a traditional database in a lot of ways. It usually contains historical data and is designed for query and analysis unlike a traditional database which is intended for transaction processing. The RDBMS environment in a data warehouse has two main components: and ETL solution (to be explained later in this post) and an OLAP (online analytical processing) engine which allows operations like roll-up, drill-down, slicing, dicing, etc.
A data warehouse has four main characteristics:
1. Subject oriented: data warehouses are intended to analyse data so they are build with a context (subject) in mind. For example, sales. This would help the organisation get answers to questions like what user segment purchased the maximum number of products.
2. Integrated: In order for data from disparate sources to make it to the warehouse, the data needs to be integrated with each other and all inconsistencies need to be resolved. More on this in the Transformation phase of ETL.
3. Non volatile: The data in the warehouse doesn’t undergo change. It is meant to be repository of all the data and not meant for deletion or change. A warehouse won’t be worth its name if it doesn’t have historical data.
4. Time variant: A data warehouse is supposed to accumulate data over time so that analysis of change over time can be done duly.
The above image illustrates
- the data sources (various transactional systems, legacy systems) that are brought together on the staging area for cleaning, transforming, integrating, etc.
- the data warehouse which contains the raw data from the staging area, subjective summary of the same, and metadata
- the users who run analytics and reporting tools on the warehouse
There is an alternate architecture of the above where data marts are created from the warehouse and the reporting/analytics is done on them. These data marts are usually specific to departments / line of businesses (like sales, HR, etc.) and contain only relevant data.
Extraction, Transformation, and Loading (ETL)
Broadly speaking, ETL is about extracting the relevant data from various data sources, integrating it together, and then finally populating the data warehouse with it. Let’s look at each step in a little more detail.
As mentioned earlier, Extraction is the process of extracting relevant data from Data Sources for inclusion into the data warehouse. In terms of logical extraction, one of the following methods are used:
- Full extraction: the entire data set is extracted for further use in ETL
- Incremental extraction: the data set that has changed from the time of last extraction is extracted
- Update notification: the extraction process is notified on the data set to be extracted
In terms of physical extraction, one of the following methods are used:
- Online extraction: the data is extracted from the data sources while they are in use
- Offline extraction: the data is extracted from a copy of the data sources, typically generated using binary logs, redo logs, etc.
Here the data extracted from various data sources is integrated with each other and any discrepancies are ironed out. Some typical examples would be:
- Format revision (numeric and string formats are reconciled)
- Decoding of fields (tacit information like M for Male and F for female is made explicit)
- Calculated and derived values (summaries of sales calculated, age derived from data of birth etc.)
- Splitting of single fields / Merging of information (Full name split into first and last name or vice versa)
- Unit of measurement conversion (conversions of metres into feet, kilograms into pounds, etc.)
- Date/Time conversion (conversion of mm-dd-yy into dd-mm-yyyy, etc.)
The data after extraction and due transformation is then loaded onto the warehouse. This mirrors the extraction process from a logical perspective. The data that was obtained as full extraction / incremental extraction / update notification is merged with the relevant data on the data warehouse.
ETL vs ELT
Of late, a few organisations have been experimenting with ELT instead of ETL. The reasons cited are:
- Entire data set is is used as part of extraction and load as opposed to select data sets in ETL. This increases the width of data available for analysis and allows changes in requirements
- Existing hardware can be used (which has already become commodity and cheap) and no specific high performance hardware is required
However, ELT tools available these days are few and organisations prefer the tried and tested route.
Business Intelligence and Data Mining
The logical next step after building a data warehouse is to leverage it to generate insights and glean knowledge out of this data. Business Intelligence uses specialised tools to allow analysts run what-if scenarios, slice/dice data to look at new paradigms, etc. Data mining, on the other hand, is more of an automated process that looks for patterns inside the data. This is used frequently in fraud detection, knowledge discovery, etc.
A frequent example of data mining is the anecdote of “beer-diapers“. It is rumoured that in the 1980, Walmart was selling beer-diapers. Its data mining system had thrown up a pattern that showed young men buying beers and diapers on Friday evenings. Walmart is said to have put the two together to result in increased sales of both. The veracity of this story is doubtful but it has been oft quoted as an example of data mining.
In the end, Rajat quoted Andrew McAfee from one of his TED Talks:
Economies don’t run on energy, capital, or labour. They run on IDEAS!
BI and Data Mining tools running on our data warehouses ensure that we never run out of ideas and are able to leverage each inflexion point in the lifecycle of an organisation.
Session: Analytics in Business – Anurag Shrivastava
Analytics has been around for more than 30 years. BigData is becoming significant due to the emergence of new types of data like blogs, photos, videos, etc.
Information about the past is available via various kinds of reports. Real-time information is available via alerts through various mediums. Combining and extrapolating the two gives us information about the future.
However, what the above is missing is “insight”. Using advanced data models, we can gain insight into what and why about past data. It helps us in generating meaningful recommendations about the present and predict / prepare for the best / worst that can happen in the future.
An example of the above is the complex algorithm running behind Amazon’s recommendation engine. The above insight capability would also help a bank identify which among its loan seekers are going to pay back the loan and which would default.
Consider the application of analytics in some industries below:
||Fraud detection, Credit Scoring
||Promotions, Shelf Management, Demand Forecasting
||Call Centre Staffing
Predictive Analysis and Data Mining
Predictive Analytics is a broad term describing a variety of statistical and analytical techniques used to develop models that predict future events or behaviours.
Data Mining is a component of predictive analytics that entails analysis of data to identify trends, patterns, or relationship among the data.
Consider the example of Insurance industry. Insurance companies typically prefer customers who would not file insurance claims. Even before they grant insurance cover to a customer, insurance companies can calculate the probability of the customer filing insurance claims. Based on that probability, the company can choose to cover or not cover the customer. Apart from data on the customer, external data is also used in such analytical models. For example, people living in mountainous regions or treacherous terrains would also have a higher probability to file claims.
BigData and Data Science can also help companies with Churn Prediction. It amounts to predicting whether a person would stop patronising a company, services, etc. Input to such a modelling algorithm would be the customer’s behaviour months before his subscription would end. For example,
- Visits to price comparison sites
- Calling call centre couple of times
- Dissatisfied with the service
- Questions asked on Facebook, Twitter, etc.
Another use of BigData can be in Accident Prevention. Consider the table below.
||70 – 90
||70 – 90
||70 – 90
This is only a small subset of data that can be easily culled from past accidents. If a smart algorithm is let loose on the entire data set, it can accurately predict the scenarios and probabilities that would result in fatal accidents. The law enforcement agencies and traffic police can take adequate measures then to avert such accidents.
Similarly, banks and other financial institutions can analyse their existing data to create a model that can identify demographic and other attributes typical of loan defaulters thereby helping the institutions make better decisions when approving loan requests.
Finally, Anurag mentioned a good book on the importance of analytics: Analytics at Work by Tom Davenport. He highlighted an excerpt from the book that identifies the success factors for analytics to work in an organisation:
D for accessible, high-quality Data
E for an Enterprise orientation
L for analytical Leadership
T for strategic Targets
A for Analysts
Session 2: Question and Answer – Anurag Shrivastava and Narinder Kumar
Anurag: What is R programming language?
Narinder: R is a programming language suited primarily for statistical work and BigData. Other languages (like Java, C#, etc.) cannot suffice for this kind of work and don’t present a capability to handle the needs of BigData
Anurag: What is machine learning? How does R language support it?
Narinder: Machine learning is one of the most important aspects of data science. The program learns by itself. For example, Google marking mails a spam, recommendations by Amazon, etc. R language bridges the gap between machine learning and BigData. It helps the data scientist in identifying the right machine learning algorithm to use and actually using that.
Anurag: How difficult is R to learn for a Java / C# programmer?
Narinder: Java and c# are both Object Oriented languages while R is a functional programming language. It has a learning curve and requires a certain mindset. R is a language used primarily by statisticians and while other languages are primarily meant for programmers. The best practices of R are not as widespread as for other languages.
Anurag: Should one use purely R or work with hybrids?
Narinder: This is more of an operational decision. The idea is not to write thousands of lines of code to implement something. One should be able to arrive at a solution with minimal code. Python also works well with R but has a different ecosystem than R in terms of APIs and support. R, Python, Octave can be used but the most suited is R and the least Octave.
Anurag: What is the difference between supervised learning and unsupervised learning?
Narinder: Supervised learning is telling the computer what to learn. For example, the variables that govern the price of a house (area, locality, age of the house) are fed into the system along with values for each. The program then extrapolates the price of a house given a new set of values.
Unsupervised learning is when the program learns on its own. For example, Google News looking at trending keywords in news and organising news based on the same.
Anurag: What is Google Prediction API?
Narinder: I was a part of the beta testing of Google Prediction API. Google lets you upload your data and they provide you insights based on it. You can program your own variables and the output you are looking for. Google would let you use their infrastructure for it. It is a kind of PaaS (platform as a service). Huge data sets can be ingested into Google BigStorage and then Google BigQuery run on the same.
Anurag: What is the relationship between Hadoop and R?
Narinder: The algorithms for analytics and statistical analysis have been existing since 1980s. What is new is the large amount of data from different sources (social networks, videos, blogs, etc.) that was not existing earlier. Combine this data with traditional data and then you have the data set that is fed into BigData analytics. R brings the intelligence of the algorithms and Hadoop gives the capability to do intensive analysis on the large data set.