Notes on The Kimball Group Reader Chapter 1: The Reader at a Glance

  • Author:: [[Ralph Kimball]], [[Margy Ross]]
  • Reading Status:: #complete
  • Review Status:: #[[complete]]
  • Tags:: #books #[[dimensional modeling]]
  • Source:: link
  • Roam Notes URL:: link
  • Anki Tag:: kimball_group_reader kimball_group_reader_ch_1
  • Anki Deck Link:: link
  • Setting up for Success
    • 1.1 Resist the Urge to Start Coding ([[Ralph Kimball]], DM Review, November 2007) (Location 944)
      • Before writing any code or doing any modelling or purchasing related to your data warehouse, make sure you have a good answer to the following 10 questions:
        • [[Business Requirements]]: do you understand them? (Most fundamental and far-reaching question)
        • [[Strategic Data Profiling]]: are data assets available to support business requirements?
        • [[Tactical Data Profiling]]: Is there executive buy-in to support business process changes to improve data quality?
        • [[Integration]]: Is there executive buy-in and communication to define common descriptors and measures?
        • [[Latency]]: Do you know how quickly data must be published by the data warehouse?
        • [[Compliance]]: which data is compliance-sensitive, and where must you have protected chain of custody?
        • [[Data Security]]: How will you protect confidential or proprietary data?
        • [[Archiving Data]]: How will you do long-term archiving of important data and which data must be archived?
        • [[Business User Support]]: Do you know who the business users are, their requirements and skill level?
        • [[IT Support]]: Can you rely on existing licenses in your organization, and do IT staff have skills to support your technical decisions?
    • 1.2 Set Your Boundaries ([[Ralph Kimball]], DM Review, December 2007) (Location 1003) #[[Business Requirements]] #[[setting boundaries]]
      • This article is a discussion of setting clear boundaries in your data warehousing project to avoid taking on too many requirements.
  • Tackling DW/BI Design and Development
    • This group of articles focuses on the big issues that are part of every DW/BI system design. (Location 1071)
    • 1.3 Data Wrangling ([[Ralph Kimball]], DM Review, January 2008) (Location 1074) #[[data wrangling]] #[[data extraction]] #[[data staging]] #[[change data capture]]
      • [[data wrangling]] is the first stage of the data pipeline from operational sources to final BI user interfaces in a data warehouse. It includes [[change data capture]], [[data extraction]], [[data staging]], and [[data archiving]] (Location 1076)
      • [[change data capture]] is the process of figuring out exactly what data changed on the source system that you need to extract. Ideally this step would be done on source production system. (Location 1080). Two approaches:
        • Using a change_date_time field in the source: a good option, but will miss record deletion and any override of the trigger producing the change_date_time field. (Location 1089)
        • Production system daemon capturing every input command: This detects data deletion but there are still DBA overrides to worry about. (Location 1089)
        • Ideally you will also get your source production system to provide a reason data changed, which tells you how the attribute should be treated as a [[slowly changing dimension (SCD)]]. (Location 1098) #[[change data capture]]
        • If you can’t do [[change data capture]] on source production system, you’ll have to do it after extraction, which means downloading larger data sets. (Location 1103)
          • Consider using [[cyclic redundancy checksum (CRC)]] to significant improve performance of the data comparison step here.
      • [[data extraction]]: The transfer of data from the source system into the DW/BI environment. (Location 1117)
        • Two main goals in the [[data extraction]] step:
          • Remove proprietary data formats
          • Move data into [[flat files]] or [[relational tables]] (eventually everything loaded into relational tables, but flat tiles can be processed very quickly)
      • [[data staging]]: [[Ralph Kimball]] recommends staging ALL data: save the data the DW/BI system just received in original target format you chose before doing anything else to it. (Location 1126)
      • [[data archiving]]: this is important for compliance-sensitive data where you have to prove data received hasn’t been tampered with. Techniques here include using a [[hash code]] to show data hasn’t changed. (Location 1129)
    • 1.4 Myth Busters ([[Ralph Kimball]], DM Review, February 2008) (Location 1135) #[[dimensional modeling]]
      • Addresses various myths related to [[dimensional modeling]]
      • Myth: A dimensional model could be missing key relationships that exist only in a true relational view. (Location 1142)
        • In fact, dimensional models contain all the data relationships that normalized models have.
      • Myth: dimensional models are not sufficiently extensible and do not accommodate changing [[business requirements]]. (Location 1149)
        • It’s the opposite: normalized models are much harder to change when data relationships change. [[slowly changing dimension (SCD)]] techniques provide the basis for models to meet changing [[business requirements]].
      • Myth: dimensional models don’t capture data at sufficient level of granularity / detail (Location 1177)
        • In fact, models should capture measurement events in [[fact tables]] at the lowest possible grain.
    • 1.5 Dividing the World ([[Ralph Kimball]], DM Review, March 2008) (Location 1188)
      • Two main entities in [[dimensional modeling]] (Kimball estimates 98% of data can be immediately and obviously categorized as one of these):
        • [[dimension ([[dimensional modeling]])]]: the basic stable entities in our environment, such as customers, products, locations, marketing promotions, and calendars. In end user BI tools, dimensions are primarily used for constraints and row headers.
        • [[fact ([[dimensional modeling]])]]: Numeric measurements or observations gathered by all of our transaction processing systems and other systems. In end user BI tools, dimensions are primarily used for computations.
          • [[fact table grain]]: Description of measurement in physical, real-world terms – a description of what each row in the [[fact table]] represents. (Location 1209) There is sometimes a temptation to add facts not true to the grain to shortcut a query, but this often introduces complexity and confusion for business users. (Location 1218)
          • A [[fact ([[dimensional modeling]])]] should be additive whenever possible – it should make sense to add facts across records. A common example here is storing extended price (i.e. price * quantity) instead of just price in a fact table where the measurement is retail sale (Location 1224)
      • A distinct characteristic of [[dimensional modeling]] is not using [[normalized data]]. Normalized models are great in transaction processing systems, but they are not understandable by business users. Dimensional models, correctly designed, contain exactly the same data and reflect the same business rules, but are more understandable. [[understandability]] is a central goal of a BI system used by business users. (Location 1249)
    • 1.6 Essential Steps for the Integrated [[Enterprise Data Warehouse (EDW)]] ([[Ralph Kimball]], DM Review, April 2008 and May 2008) (Location 1254) #[[data integration]]
      • This section provides an overall architecture for building an integrated [[Enterprise Data Warehouse (EDW)]] which supports [[Master Data Management (MDM)]] and and has the mission of providing a consistent business analysis platform for an organization. (Location 1258)
      • Essential act of the [[Enterprise Data Warehouse (EDW)]] is [[drilling across]]: gathering results from separate [[business process subject area]]s and combine them into a single analysis. (Location 1282)
      • A key prerequisite to developing the [[Enterprise Data Warehouse (EDW)]] is a significant commitment and support from top-level mangement on the value of data integration. (Location 1303)
      • Having an existing [[Master Data Management (MDM)]] project is a good sign of executive buy-in for data integration, and significantly simplifies data warehouse [[data integration]]. (Location 1308)
      • [[conformed dimensions]] and [[confirmed facts]] provide the basis for [[data integration]] (Location 1316)
        • [[conformed dimensions]]: two dimensions are conformed if they contain one or more common fields whose contents are drawn from the same domains. (Location 1318) Typical examples: customer, product, service, location, employee, promotion, vendor, and calendar. (Location 1367)
        • [[conformed facts]]: numeric measures that have the same business and mathematical interpretations so that they may be compared and computed against each other consistently. (Location 1320)
      • [[enterprise data warehouse (EDW) bus matrix]]: two-dimensional matrix with [[business process subject area]] on the vertical axis and [[dimension tables]] on horizontal axis. (Location 1324) An X in the matrix represents where a subject area uses a dimension. It helps you prioritize development of separate subject areas and identify possible scope of [[conformed dimensions]]. "The columns of the bus matrix are the invitation list to the conformed dimension design meeting." (Location 1333) This is an important item to send to senior management to review before conformed dimension design meetings. "If senior management is not interested in what the bus matrix implies, then to make a long story short, you have no hope of building an integrated EDW." (Location 1335)
        • Note that the different stakeholders don’t have to give up their domain specific private attributes that they need – stakeholders just need to agree on the [[conformed dimensions]]. (Location 1341)
        • Even when you get senior management full buy-in, there is a lot of operational management involved in the [[Enterprise Data Warehouse (EDW)]], including two abstract figures: the [[dimension manager]] (builds and distributes a conformed dimension to the rest of the enterprise) and the [[fact provider]] (downstream client to the dimension manager who receives and utilizes the conformed dimension, almost always while managing one or more fact tables within a subject area). (Location 1347)
    • 1.7 Drill Down to Ask Why [[Ralph Kimball]], DM Review, July 2008 and August 2008 (Location 1481) #[[decision making]]
      • Important to understand how your data warehousing system drives decision-making, not just your technical architecture.
      • [[Bill Schmarzo]] architecture for decision making, aka [[analytic application process]]: (Location 1489)
        1. Publish reports.
        2. Identify exceptions.
        3. Determine causal factors. Seek to understand the “why” or root causes behind the identified exceptions. Main ways you might do this: #[[causality]] #[[determining causality]]
          • Get more detail
          • Get a comparison
          • Search other data sets
          • Search the web for information about the problem
        4. Model alternatives. Provide a backdrop to evaluate different decision alternatives.
        5. Track actions. Evaluate the effectiveness of the recommended actions and feed the decisions back to both the operational systems and DW, against which published reporting will occur, thereby closing the loop.
    • 1.8 Slowly Changing Dimensions [[Ralph Kimball]], DM Review, September 2008 and October 2008 (Location 1557) #[[slowly changing dimension (SCD)]]
      • The Original Three Types of [[slowly changing dimension (SCD)]] cover all the responses required for a revised or updated description of a dimension member (Location 1569)
        • [[type 1 slowly changing dimension (SCD)]]: Overwrite
        • [[type 2 slowly changing dimension (SCD)]]: Add a New Dimension Record
        • [[type 3 slowly changing dimension (SCD)]]: Add a New Field
    • 1.9 Judge Your BI Tool through Your Dimensions – [[Ralph Kimball]], DM Review, November 2008 (Location 1650) #[[BI tools]] #[[BI tool selection]] #[[dimension tables]]
      • [[dimension tables]] implement the [[UI]] of your BI system: they provide the labels, the groupings, the drill-down paths.
      • This article describes [[requirements]] a BI tool should be able to meet with dimensions:
        • Assemble a BI query or report request by first selecting [[dimension table attributes]] and then selecting [[facts (dimensional modelling)]] to be summarized.
        • [[drilling down]] by adding a row header
        • Browse a dimension to preview permissible values and set constraints
        • Restrict the results of a dimension browse with other constraints in effect
        • [[drilling across]] by accumulating measures under labels defined by conformed dimension attributes
    • 1.10 [[fact tables]] – [[Ralph Kimball]], DM Review, December 2008 (Location 1707)
      • [[fact tables]] contain the fundamental measurements of the enterprise and are the target of most data warehouse queries.
      • Design rules for [[fact tables]]:
        • Stay true to the [[fact table grain]] – take care in defining the [[grain (dimensional modelling)]] – what a single record in the fact table represents. This is the first and most important design step. It ensures the [[foreign keys]] in the fact table are grounded and precise.
        • Build up from the lowest possible [[fact table grain]]. This ensures you have the most complete set of [[dimension tables]] that can describe the fact table and enables detailed [[drilling down]] for the user.
      • 3 types of [[fact tables]]: (Location 1741)
        • [[transaction grain [[fact table]]]]: Measurement taken at a single instance (e.g. each cash register beep). Transactions can happen after a millisecond or next month or never – they’re unpredictably sparse or dense.
        • [[periodic snapshot grain [[fact table]]]]: Facts cover a predefined span of time. Powerful guarantee: all reporting entities will appear in each snapshot, even if there is no activity – it’s predictably dense and applications can rely on certain key combinations being available.
        • [[accumulating snapshot grain [[fact table]]]]: Rows represent a predictable process with a well-defined beginning and end (e.g. order processing, claims processing).
    • 1.11 Exploit Your [[fact tables]] – [[Ralph Kimball]], DM Review, January/February 2009 (Location 1765)
      • This article describes basic ways to exploit the 3 main fact table designs in the front room and in the back room.
      • Front Room: [[aggregate navigation]] – choosing to give the user pre-aggregated data at run time, without without the end user knowing the difference. Seamlessly provide aggregated and detailed atomic data.
      • Front Room: [[drilling across]] Multiple Fact Tables at Different Grains – you can do this as long as you choose [[conformed dimensions]] for the answer set row headers that exist for all the fact tables in your integrated query.
      • Front Room: Exporting Constraints to Different Business Processes – building connections to other [[business process subject area]] in the [[UI]] so you can explore related data in a single click or swipe.
      • Back Room: [[fact table surrogate keys (FSKs)]] – sometimes you want to do this for one of the following benefits
        • Uniquely and immediately identify single fact records.
        • FSKs assigned sequentially so a load job inserting new records will have FSKs in a contiguous range.
        • An FSK allows updates to be replaced by insert-deletes.
        • An FSK can become a foreign key in a fact table at a lower grain.
For access to my shared Anki deck and Roam Research notes knowledge base as well as regular updates on tips and ideas about spaced repetition and improving your learning productivity, join "Download Mark's Brain".

2 thoughts on “Notes on The Kimball Group Reader Chapter 1: The Reader at a Glance”

Leave a Reply to baccaratsite Cancel reply

Your email address will not be published. Required fields are marked *