ETL Process
Week 11: Reading
Etract, Transform, and Load (ETL)
Learning Outcomes
- Learn how the Extract, Transform, and Load Process Works.
- Learn how the 10 steps of ETL.
Lesson Materials
The Extract, Transform, and Load (ETL) process contains ten basic steps. They are:
- Determine which data you want by qualifying the:
- The granularity of data:
- Granularity is a unit of measurement that determines the accuracy of the data. For example, collection the data daily is a finer grain measure of information than collecting data weekly or monthly.
- The granularity of locality:
- Locality is also a unit of measurement. It that determines whether you collect data by city, country, or state. The more precise the location, the finer the granularity of locality.
- Determine the platform, structure, character set, and encoding of source data:
- The data platform:
- The data platform dictates much about the data:
- It qualifies the basic encoding, which may be ASCII (American Standard Code for Information Interchange) or EBCDIC (Extended Binary Coded Decimal Interchange Code). ASCII encoding is the basis for Unix, Linux, and Microsoft Windows and many other operating systems. EBCDIC is the basis for the IBM mainframe systems.
- Within each encoding schema there are character sets. Character sets define how we represent data. Character sets support given languages, like English, Arabic, Chinese, and Japanese.
- Structures are built from sets of characters arranged to have meaning and scope. Sets of structures may be organized in various files. The files can be arranged in many formats, like position specific, comma separated values (CSVs), tab separated values (TSVs), Extensible Markup Language (XML), JavaScript Object Notation (JSON).
- Determine how you are going to read, store, write, and process data.
- Determine the common target data structure into which you will transform the original data sets.
- Determine how to transform the data into a consistent transient data structure target.
- Determine where you will store the transient data.
- Determine the target data warehouse repository.
- Determine the required programming language to move the data from the transient repository into the target data warehouse repository.
- Determine the best way to import the data into the data warehouse repository.
- Determine when you will load the data into the data warehouse repository.