Designing the Star Schema Database
Version 1.1
By Craig Utley
Introduction
Creating a Star Schema Database is one of the most important, and sometimes the final, step in creating a data warehouse. Given how important this process is to building a data warehouse, it is important to understand how to move from a standard, on-line transaction processing (OLTP) system to a final star schema. Please note that a general term is relational data warehouse and may cover both star and snowflake schemas.
This paper attempts to address some of the issues that many who are new to the data warehousing arena find confusing, such as:
- What is a Data Warehouse? What is a Data Mart?
- What is a Star Schema Database?
- Why do I want/need a Star Schema Database?
- The Star Schema looks very denormalized. Won’t I get in trouble for that?
- What do all these terms mean?
This paper will attempt to answer these questions, and show developers how to build a star schema database to support decision support within their organizations.
Terminology
Usually, readers of technical articles are bored with terminology that comes either at the end of a chapter or is buried in an appendix at the back of a book. Here, however, I have the thrill of presenting some terms up front. The intent is not to bore readers earlier than usual, but to present a baseline off of which to operate. The problem in data warehousing is that the terms are often used loosely by different parties. The definitions presented here represent how the terms will be used throughout this paper.
OLTP
OLTP stands for Online Transaction Processing. This is a standard, normalized database structure. OLTP is designed for transactions, which means that inserts, updates, and deletes must be fast. Imagine a call center that takes orders. Call takers are continually taking calls and entering orders that may contain numerous items. Each order and each item must be inserted into a database. Since the performance of the database is critical, database designers want to maximize the speed of inserts (and updates and deletes). To maximize performance, some businesses even limit the number of records in the database by frequently archiving data.
OLAP and Star Schema
OLAP stands for Online Analytical Processing. OLAP is a term that means many things to many people. Here, the term OLAP and Star Schema are basically interchangeable. The assumption is that a star schema database is an OLAP system. An OLAP system consists of a relational database designed for the speed of retrieval, not transactions, and holds read-only, historical, and possibly aggregated data.
While an OLAP/Star Schema may be the actual data warehouse, most companies build cube structures from the relational data warehouse in order to provide faster, more powerful analysis on the data.
Data Warehouse and Data Mart
Data Warehouses and Data Marts differ in scope only. This means that they are built using the exact same methods and procedures, so the process is the same while only their intended scope varies.
A data warehouse (or mart) is way of storing data for later retrieval. This retrieval is almost always used to support decision-making in the organization. That is why many data warehouses are considered to be DSS (Decision-Support Systems). While some data warehouses are merely archival copies of data, most are used to support some type of decision-making process. The primary benefit of taking the time to create a star schema, and then possibly cube structures, is to speed the retrieval of data and format that data in a way that it is easy to understand. This means that a star schema is built not for transactions but for queries.
Both a data warehouse and a data mart are storage mechanisms for read-only, consolidated, historical data. Read-only means that the person looking at the data won't be changing it. If a user wants to look at the sales yesterday for a certain product, they should not have the ability to change that number. Of course if the number is wrong, it should be corrected, but more on that later.
"Consolidated" means that the data may have come from various sources. Many companies have purchased different vertical applications from various vendors to handle such tasks as human resources (HR), accounting/finance, inventory, and so forth. These systems may run on multiple operating systems and use different database engines. Each of these applications may store their own copy of an employee table, product table, and so on. A relational data warehouse must take data from all these systems and consolidate it so it is consistent, which means it is in a single format.
The "historical" part means the data may be only a few minutes old, but often it is at least a day old. A data warehouse usually holds data that goes back a certain period in time, such as five years. In contrast, standard OLTP systems usually only hold data as long as it is "current" or active. An order table, for example, may move order data to an archive table once the order has been completed, shipped, and received by the customer.
The data in data warehouses and data marts may also be aggregated. While there are many different levels of aggregation possible in a typical data warehouse, a star schema may have a "base' level of aggregation, which is one in which all the data is aggregated to a certain point in time.
For example: assume a company sells only two products: dog food and cat food. Each day, the company records the sales of each product. At the end of a couple of days, the data looks like this:
| | | Quantity Sold |
| Date | Order Number | Dog Food | Cat Food |
| 4/24/99 | 1 | 5 | 2 |
| | 2 | 3 | 0 |
| | 3 | 2 | 6 |
| | 4 | 2 | 2 |
| | 5 | 3 | 3 |
| | | | |
| 4/25/99 | 1 | 3 | 7 |
| | 2 | 2 | 1 |
| | 3 | 4 | 0 |
Table 1 Clearly, each day contains several transactions. This is the data as stored in a standard OLTP system. However, the data warehouse might not record this level of detail. Instead, it could summarize, or aggregate, the data to daily totals. The records in the data warehouse might look something like this:
| | Quantity Sold |
| Date | Dog Food | Cat Food |
| 4/24/99 | 15 | 13 |
| 4/25/99 | 9 | |
Table 2 This summarization of data reduces the number of records by aggregating the individual transaction records into daily records that show the number of each product purchased each day.
In this simple example, it is easy to drive Table 2 simply by running a query against Table 1. However, many complexities enter the picture that will be discussed later.
Aggregations
There is no magic to the term "aggregations." It simply means a summarized, typically additive value. The level of aggregation in a star schema depends on the scenario. Many star schemas are aggregated to some base level, called the grain, although this is becoming somewhat less common as developers rely on cube building engines to summarize to a base level of granularity.
OLTP Systems
OLTP, or Online Transaction Processing, systems are standard, normalized databases. OLTP systems are optimized for inserts, updates, and deletes; in other words, transactions. Transactions in this context can be thought of as the entry, update, or deletion of a record or set of records.
OLTP systems achieve greater speed of transactions through a couple of means: they minimize repeated data, and they limit the number of indexes. The minimization of repeated data is one of the primary drivers behind normalization.
When examining an order, systems typically break orders down into an order header and then a series of detail records. The header contains information such as an order number, a bill-to address, a ship-to address, a PO number, and other fields. An order detail record is usually a product number, a product description, the quantity ordered, the unit price, the total price, and other fields. Here is what an order might look like:

Figure 1 The data stored for this order looks very different. If stored in a flat structure, the detail records look something like this:
| Order Number | Order Date | Customer ID | Customer Name | Customer Address | Customer City |
| 12345 | 4/24/99 | 451 | ACME Products | 123 Main Street | Louisville |
| Customer State | Customer Zip | Contact Name | Contact Number | Product ID | Product Name |
| KY | 40202 | Jane Doe | 502-555-1212 | A13J2 | Widget |
| Product Description | Category | SubCategory | Product Price | Quantity Ordered | Etc… |
| ¼” Brass Widget | Brass Goods | Widgets | $1.00 | 200 | Etc… |
Table 3 Notice, however, that for each detail, much of the data is being repeated: the entire customer address, the contact information, the product information, and so forth. All of this information is needed for each detail record, but the system should have to store all the customer and product information for each record. Relational technology allows each detail record to tie to a header record, without having to repeat the header information in each detail record. The new detail records might look like this:
| Order Number | Product Number | Quantity Ordered |
| 12473 | A4R12J | 200 |
Table 4 A simplified logical view of the tables might look something like this: