The Data Lake – understanding the concept

June 8, 2019 by · Leave a Comment
Filed under: Business Analyst Skills, Data 

As data capture has grown so have some of the techniques of handling the data. For about 10 years now, the Data Lake has started to appear in the business world as part of the data capture concept.

Originally when I started out, data was distributed all over the place with business analysts having to ask for extracts from various departments to get an overall view of the company. It was time consuming.

Next came the large data warehouse accepting in data from all over the company to a central store. However it could take years to get that data into the data warehouse. At one place I worked, it was a minimum of 2 years to absorb data into the data warehouse. Delay in getting data in was caused by the need to model the data and understand it completely before it could be absorbed. Data modelers would have to work out if new tables were needed and BAs would have to justify the business cost of storing the data. Add onto this that existing reports would be expected to use the data from the data warehouse and these reports would all have to be rebuilt to use the new data structure.

As companies have evolved to produce even more data, the data warehouse wait time was increasing significantly. Waiting for centralized data however did not tie in well with corporate strategy of being able to know what is going on around the company. At this point the Data Lake concept came into being. The Data Lake is basically a collection point for all data from around a company in any type of data structure. Data does not need to be refined to end up in the Data Lake. Good and bad data is collected. Visually the Data Lake term represents departments that generate data as streams that feed the lake.

As the data collects in the Data Lake, eventually some of it will make its way into the enterprise data warehouse based on need and cost justification. By creating a Data Lake approach, it has created a one source of data for people in a company to access. Data scientists can look at what is being captured and see if any of it is of use to what they are trying to analyze.

Pros of Data Lake:

  • Centralized repository of company data which in theory makes it easier to find data.
  • Quick to capture data into as not refined in anyway.
  • Allows the data source departments to focus on supporting their applications / business and not on providing formal data extracts that have to be absorbed by a data warehouse or other team.
  • Don’t have to wait on departmental availability of resources to get access to another department’s data.

Cons of Data Lake:

  • Resources have to be hired to support the collection of data into the data lake and the sharing of it.
  • Failure to get good searchable metadata on the data being store in the Data lake would prevent the data from being discovered at a later date.
  • Resources associated with the original data generation are not part of the Data Lake team which means the personal knowledge on the Data Lake team is limited to non-existent. Data knowledge is totally reliant on the metadata captured at the time the data is stored.
  • Useful and not so useful data is captured as the focus is capturing data.
  • Dependent on cheap storage to justify the large storage costs and the resources to support the physical storage / networks etc.
  • Secure data should not end up in a Data Lake due to risk that it may be exposed.
  • Not for operational reporting where reports have to be generated in 24 hours or less of data being created.

In summary, the Data Lake concept is just a fancy way of saying centralized raw data store created from data provided via different departments in a company. A Data Warehouse can pull data from the Data Lake for storage in the Warehouse at a later date once the need for it to be stored formally has been identified.