Data Mining with SQL Server 2005 by ZhaoHui Tang, Jamie MacLennan


Data Mining with SQL Server 2005

  • Author: ZhaoHui Tang, Jamie MacLennan
  • Softcover: 480 pages
  • Publisher: Wiley(September 26, 2005)
  • Book dimensions: 9.2 x 7.4 x 1.1 inches
  • ISBN: 0471462616
  • ITBookworm score: 4.0 /5

Synopsis

The only word for this book is ‘tremendous’. Learn data mining from the experts at Microsoft. I have never read a book on Microsoft data mining that digs into the detail this one does. It’s not what I’d consider a tutorial, though it does have some step by step walkthroughs. This book is mainly a thorough explanation of the Microsoft data mining framework, its algorithms, and its methods. Before I read this book, I knew a few things about mining, but I never knew the difference between all of the algorithms. I really don’t feel that they’re all explained equally well, but they’re all explained well enough and like I said, some in very good detail. I really wish it was more of an end to end tutorial for beginners, but I have still learned more from this book than any other. Another area I felt where this book could have done a better job is in covering the new features of SQL Server 2005. Every now and then it hints at new features, but I do think they could have done a better job at highlighting them. I think it’s that one detail that makes this more of a book on Microsoft data mining that happens to take place in SQL Server 2005 than it is a book specifically on data mining in SQL Server 2005.

Who is this book really for?

This book is not for someone who knows absolutely nothing about data mining or databases. It does spend time going over the basics, but you just won’t get the detailed instruction here that you would from a book designed to be an actual tutorial and teach you from the ground up. If you already have some experience with mining, and you want to really understand the details of each of the algorithms, then this book is for you. If you are a beginner however, this book would be very good for you to read in conjunction with a tutorial on data mining. So basically, if you’re involved with data mining at all, you need this book.

Writing Style

The style of this book is very straightforward. There aren’t any amusing stories, or jokes. The examples are really good and they get the point across, but I wouldn’t call them amusing. I’d say the style is somewhat that of BOL or MSDN in that it just presents the information and doesn’t make any real attempt to be interesting. That’s not necessarily a bad thing though. Unnecessary corniness is annoying, and this is a very difficult subject to grasp so there may not be any real room for a lot of extra talk. Then again, it’s always nice to read an interesting book no matter what the topic. If you’re interesting in data mining though, it’s not a boring read, just not amusing.

Condition

This book is made very well. I bent the binding and stacked books on top of it open and it withstood the abuse quite well. This is a book that will last quite a while.

Table of Contents

Chapter 1: Introduction to Data Mining

This first chapter really sets an excellent tone for the book. This is really a very complete introduction to data mining in general and if you really don’t know much about it at all, then you’ll walk away from this chapter knowing what it’s all about. It describes what it’s for, what it does, where it’s applicable, etc. It also lists all the major players in the market. Here are some specifics. It starts out with a good definition and some quality examples to help you understand what data mining actually is. It then gets into business problems that can be solved using data mining including churn analysis, cross-selling, fraud detection, etc. Next you get into some mining tasks like classification, clustering, and association. These are some of the most basic terms in mining, and they’re fully explained with easy to understand examples. Then you get into a very good discussion on mining techniques. It starts out with data flow, but very quickly moves into a detailed description of the whole data mining project cycle. The different steps are listed and discussed in some detail. Sometimes I think too much information is given, but so rarely the case in books anymore that I’m not going to complain about it. What it does mean though is that you’ll be able to use this book as a reference as your skills increase. Now you get into a very good discussion on the mining market as it stands today. The major players (vendors) are all given attention with a very good description of what they do and how good they are at it. The natural flow of this chapter leads you next to a discussion on current issues in the mining field. Along with this, you get into mining standards. OLE DB for DM and the Java API are both discussed. PMML is explained with even a code sample of the XML it produces so you can get an idea of what you’re working with. This is all in answer to the ‘current problems’ section. Many problems for new technologies like this stem from a lack of standards, and data mining has reached its time in the marketplace. It’s definitely time to standardize. Next you get into mining methodologies like Crisp-DM and CWM. That takes you straight into new trends in the mining field. You really get a sense of where things are going and why you should care. In all, this is an excellent chapter. If you’re one of those people who gets excited about a new book and reads the first couple chapters and then loses interest, then this chapter will definitely suit your needs. In fact, you could probably pass a job interview quite nicely on what you learn in this chapter alone.

Chapter 2: OLE DB for Data Mining

Starts out with a nice intro to OLE DB. Many DBAs really don’t have much of a grasp on what OLE DB really is, and how it fits into the scheme of things, and this gives kind of a ground up explanation of what it is and its structure. The authors also make it a special point to reveal Microsoft’s theory on data mining. That theory is simple: data mining has been around for a long time, but so far has been so complicated that it has been restricted to Ph.D.s in statistics. No mere mortal has been able to get a firm grasp on it. Microsoft is trying to change all that by bringing it to the masses just like they did with databases. Make no mistake though. No matter how simple they try to make data mining it’s still a very complicated subject, and the Ph.D.s will always be better at it than we are. At least they’re trying to make it more understandable, and for that, I commend them. Basic concepts come into play next. The case, case key, nested key, etc are all explained well… though mostly in relational terms, so a knowledge of relational databases is necessary. It then briefly goes into the different column types you can have in your case and then moves directly into a discussion of the different stages of the model lifecycle. Next you get into an introduction to DMX (Data Mining eXtenstions), the language used to create and query mining models. There’s a really interesting note on the methods of descretization that I really hope they elaborate on later in the book. The four methods are Clusters, EqualAreas, Thresholds, Automatic, and a brief description of each is given. DMX for model creation is discussed, with enough examples to get a taste. Next, DMX for model training is given a little ink. There are only a couple examples, but again, you’ll get the idea. DMX for model prediction is actually given quite a bit of space. It gives brief examples of query syntax, and joins. Next, and probably the biggest section of the entire chapter, is prediction functions. All types of functions are discussed. For the most part though, it’s just a table listing each one with a brief description of its usage. There is a little discussion in between each one, and the tables are broken down by category for example, functions on scalar columns, or table columns, etc. Next there’s a brief description of singleton queries with a couple examples. And while not a lot of time is given to them, you again at least get the idea of what’s going on with them. Now you get into several pages on understanding schema rowsets. All the different types of rowsets are discussed with tables of the functions used to work with them.

Chapter 3: Using SQL Server Data Mining

This is kind of a long chapter that starts out walking you through the BI interface of Yukon. It’s just a basic walkthrough that shows what all the windows and panes do. It also explains a very important aspect of developing in this environment, and that’s the difference between offline and immediate mode. I really don’t know why it’s not offline and online mode, but the whole point is that one mode is connected directly to the DB and makes changes instantly, and the other is on your workstation and changes have to be pushed to the server. I’ll let you decide which is which. I do like the way that it gives you step by step directions on how to start each mode, and even how to switch between them. Next it gets into setting up data sources for your projects. It discusses what data sources are, the security modes available, and gives a quick walkthrough on how to create one. It also discusses data source views (DSVs) and how to create them. It then moves into named calculations and shows some samples of different mathematical functions, case expressions, etc. It’s all pretty basic though, so any decent DBA will fly through this section. Next it discusses working with named queries and how to create them. Creating and editing models is given some attention next. The data mining wizard is discussed in some decent detail and screenshots are given to show what the different screens look like. Next different aspects of working inside the interface are covered. It shows step by step how to create a model and work with it in different ways. Now you get into processing the model. It talks about what processing is and what the different options are for processing a model. One topic I found very interesting was the discussion on understanding the model viewers and the mining accuracy chart. It gives some very nice discussion on where it would be useful to use mining models and how the viewers and charts can be read. Next, there are several small sections on the prediction builder, reporting, and the management studio.

Chapter 4: Microsoft Naïve Bayes

This is kind of a short chapter, but it gets right to it. It starts out giving an intro to the Naïve Bayes algorithm… who created it, what it’s used for, etc. It really gives a very good description and even shows you the formula it uses and talks you through an explanation. It gives plenty of examples of each point to make sure you understand the concepts it’s explaining. To actually understand what this algorithm does, you’ll never find a better work. Next, it goes into the parameters the algorithm takes. Being such a simple algorithm it doesn’t really have a lot of parameters it can take, but each one is discussed adequately. Now there’s a brief section on the DMX needed to work with it. There’s nothing special about it, but it would be incomplete if it didn’t show you how to use it too. MY favorite section of the chapter is this last one on understanding the Naïve Bayes content. It shows you the 4 different views from the model viewer and explains how to read each one. It also gives you some hints about working with the GUI where appropriate. In all, this is a very worthwhile chapter, and I can honestly say I learned something. I may not run out and become a data mining guru, but I now know what this algorithm is all about which was the point of this section.

Chapter 5: Microsoft Decision Trees

This starts out well enough with examples of what this algorithm is all about, who created it, and how it basically works. It then moves into tree growth and gives some pretty good examples. Don’t the wrong idea though. This is a complicated topic, and this algorithm is either harder to really grasp, or it just isn’t explained as well as the last chapter was. I think this is a good chapter, I just don’t think that it’s geared toward the absolute beginner. It then goes into some advice on working with many states, avoiding overtraining of your models, incorporating prior knowledge, etc. These are all pretty small sections, so they’re clearly more advice than solid proven techniques with coded examples. Regression is discussed next and it gives the classic regression formula. Next the algorithm parameters are discussed and given enough attention to make it worth your while. Now comes the point of using decision trees. DMX queries and syntax are discussed. The different models are talked about next… regression and association. Following the same pattern as the last chapter, it next goes into interpreting the model inside the model viewer. In all, I would say that the examples here are kind of incomplete and that’s one of the things that led to my lack of strong understanding of this algorithm. Many times when an example is given, it’s over long before it has a chance of actually explaining what’s going on.

Chapter 6: Microsoft Time Series

Again, this chapter follows the same basic pattern that all the other algorithm chapters do. This chapter is actually much easier to understand than the last one. So, either time series is much easier to understand, or the author just explains it much better than he did in the last chapter. I will say however, that the examples here are much more plentiful and complete. That said, it starts out talking about the basics of time series with plenty of examples to guide you. Within the first couple pages you will gain a firm understanding of time series and when you would use it. Of course, make no mistake again… there is a level of complication and it quickly starts talking about autoregression (AR). If you’re not schooled in statistics you may not be able to follow along with the AR matrix given. AR trees and seasonality and how SQL Server 2005 handles them are also covered. Next you get into making historical predictions and caching predictions. Next it gets into the algorithm parameters and what they mean. Then DMX with plenty of examples is discussed. The chapter finishes up with how to interpret the model inside the model viewer. In all, this is an excellent chapter, and it’s not necessary to understand everything to be able to use this algorithm effectively.

Chapter 7: Microsoft Clustering

Let me start by saying that the concept of clustering is much easier than actually getting in there a digging deeply into the subject. The author does a very good job of explaining how objects can be clustered and why. The basics are very easy to understand, but they may leave you in the dust when they start talking about ‘simple Euclidean distance’. My favorite statement in the entire book is found on page 190. “Clustering relies on guessing and lying.” It then goes into hard and soft clustering (k-means and expectation maximization). Discrete and scalable clustering is now discussed and they do a pretty good job. Now the algorithm parameters are gone over like in the other chapters. Using clustering models and DMX are both covered after that, and the DMX portion goes into many different aspects like probability, histogram, and likelihood. It finishes up again with understanding cluster models inside the model viewer.

Chapter 8: Microsoft Sequence Clustering

You mathematicians are going to love this. If you’re not a math geek however, you may get a little lost in the dust. By the time you finish this chapter you’ll either be best friends, or best enemies with the heart of the Microsoft Sequence Clustering algorithm—the Markov Chain. The Markov Chain is very integral here, and the bulk of the chapter is dedicated to understanding its different components. Again though, this isn’t an easy topic to master. The chapter is full of equations with brief explanations that don’t really do that much for you. Perhaps fully understanding the Markov Chain is out of the scope of this book, and you’re not really meant to. All the same, for those of you who understand it, here it is, and the rest of us will just have to shut up and take their word for it. Next, we’re already to the algorithm parameter section. That should tell you how much time was taken up with Mr. Markov. Now we get into DMX queries. There are actually a good number of examples given and it does a good job of showing you how to actually use DMX against this algorithm. Finishes up with interpreting the model from the model viewer.

Chapter 9: Microsoft Association Rules

From the previous chapter, sequence clustering is basically the study of sequences of events and objects. It’s not enough to know that milk and bread were purchased together, what’s important also is which one was bought first. So, web clicks and the order you navigated through the site is what’s really important. This chapter, association rules, is really only concerned with which items are purchased together, not the order in which they were bought. That’s the difference in the two algorithms as I understand it. This chapter does a really nice job of explaining the specifics starting with of course, the basic concepts like itemset, probability, importance and the like. This is another one of those chapters that really explains things well, so you shouldn’t have any trouble with this one. Algorithm parameters are next and there are actually a full page of them. DMX queries are covered but there’s not a lot of attention given to them so there’s either not that much to say, or the author felt that this wasn’t the place to get into the advanced mechanisms. And again we finish up with interpreting the model in the viewer.

Chapter 10: Microsoft Neural Network

This is a pretty good introduction to a complicated topic. It gives a good explanation of what the neural network does and the basics of how it compares with the other algorithms. Combination and activation are covered with some nice diagrams that are actually not the easiest to understand, but you get the basic idea without having to understand everything. Next you get into backpropagation and conjugate gradient. Here come more complicated equations that most of us won’t get anything out of, but it’s a good explanation of the concepts. I do feel it could benefit from some more examples, but we take what we can get. Normalization and mapping are next with some more nice equations. Now there’s a full page of the topology of the neural network. Next, the algorithm parameters are discussed with the DMX queries following. There really isn’t much time dedicated to DMX though. It now finishes with interpreting the model inside the viewer. This is the last chapter on the algorithms.

Chapter 11: Mining OLAP Cubes

Starts out by introducing OLAP by comparing it to OLTP. The basic principles of both are given. It’s hard to tell if this is trying to be a beginner’s intro, or just a refresher for pros. All the same, it’s a very high-level view so you won’t be using this to actually teach you modeling. Next it gets into understanding star and snowflakes. There’s really not much time devoted to it, but I think the discussion is quite adequate. Then, specifics are discussed like dimensions and hierarchy, measures and measure groups, etc. Processing, storage, and proactive caching are all discussed next. Again, these are all high-level views, so if you want any real deep detail you’ll need to go either to BOL or to a more specialized resource like a whitepaper. Querying cubes is discussed here with a very brief intro to MDX. This really isn’t meant to be a tutorial on MDX because there’s really not much instruction given at all. Understanding the UDM is given some attention next, and it gives the benefits and explains how it can be used, but doesn’t really tell you how to use it. A very important section here is on understanding the relationship between OLAP and mining. These two things are definitely related and I think a lot of people just don’t understand the relationship very well. The rest of the chapter goes over using the wizard to create mining models. It’s really a nice step by step walkthrough with screenshots and everything. It’s the closest thing this book comes to being an actual beginner’s tutorial.

Chapter 12: Data Mining with SQL Server Integration Services

This is a fabulous chapter for SSIS. There is no other place, including BOL, where you can find the detailed information and screenshots of each of the data mining transformations. It starts out with a very brief overview of SSIS and quickly moves into the transformations. There’s not much more to say… definitely read this chapter and you won’t regret it.

Chapter 13: SQL Server Data Mining Architecture

This chapter is for data mining and OLAP admins. It starts out talkinga bout XMLA and the XMLA APIs. There are 2 APIs, discover and execute, and it devotes a section to each of them to discuss the specifics individually. There’s even some decent sample code given. It then goes into XMLA and Analysis Services and processing architecture. Next a very important section… administration. It goes over server config and security. There’s not a lot said about security, but the high-level overview is given, and if you want a more detailed account, you may check BOL.

Chapter 14: Programming SQL Server Data Mining

This chapter is a very nice intro to all of the APIs that can be used to program against data mining. It gives a very nice explanation of each of the APIs telling what they are and when they’re used. It then goes into various programming topics with plenty of example code to illustrate. Basically, you’re winding up here by showing everything that you’ve already done with wizards in code.

Chapter 15: Implementing a Web Cross-Selling Application

Lays the basic groundwork for coding a web app that cross-sells products. Cross-selling is like when you make recommendations based off of previous purchases, or clicks.

Chapter 16: Advanced Forecasting Using Microsoft Excel

Pretty self-explanatory here. It discusses how to work with Excel for advanced forecasting. I’m not going into too many specifics though… if you get this far you’ll know whether you need this chapter or not.

Chapter 17: Extending SQL Server Data Mining

Discusses the plug-in algorithm framework so that you can extend the functionality as needed.

Chapter 18: Conclusion and Additional Resources

Just about 4 pages recapping the concepts from the rest of the book and giving some very good recommendations for further reading.

-Sean McCown, ITBookworm.com

Tags: ,

Leave a Reply

Your email address will not be published. Required fields are marked *

*