I don’t usually blog about politics here, but when bad data management and bad people mix, it’s time for a post…
Toronto Star reporter Robyn Doolittle has reported that my world famous (infamous?) mayor Rob Ford, may have lost all the data from his previous election campaign.
Councillor Doug Ford claims the mayor’s former campaign manager, Nick Kouvalis, is refusing to turn over valuable 2010 voter database information.
Kouvalis, who also served for a time as Ford’s chief of staff, is now working for the John Tory campaign. The man who actually ran the database aspect of Rob Ford’s first mayoral campaign says the Fords were given everything right after the election.
“I made two DVDs with all of the data from the campaign — entire voters’ list with contact info, supporters, non-supporters, signs, volunteers, all voter contact records, etc. — and gave them both to Doug Ford,” said Conservative data expert Mitch Wexler.
If it is in fact gone, it would be a serious blow to the mayor’s re-election hopes. Numerous political strategists involved in the 2010 race say what helped set Ford apart was that voter intelligence, much of it collected by Ford himself over his 10 years as a councillor in Etobicoke.
I’ll try to not to comment on the use of the term “voter intelligence.” Just in case you’ve been hiding under a rock (not a crack rock, I presume) our mayor has been in a heap of trouble (NSFW) since he was elected. Actually, even before he was elected. This isn’t a partisan thing when I say I’m not a fan of my mayor. This is all about not respecting his behaviour. But back to the data thing….
Where Rob Ford’s Data Management Went Wrong
Well, pretty much every single thing he has done has been wrong. At least it feels that way. And sounds and views that way. But if we focus on today’s issue of his reported data loss, I’m thinking he messed up by:
- Giving source data to an external party without a backup. When Ford handed over those record boxes full of 10 years of handwritten notes, he lost his source data. All data deserves protection, even handwritten notes. We in IT sometimes ignore paper data, but we shouldn’t. It’s still data.
- Storing personally identifiable and sensitive data insecurely. I’m betting those file boxes where sitting next to his desk. Sure, his desk is in city hall and I’m betting they have decent physical security. But file boxes aren’t exactly locked cabinets. They also have a way of getting disposed of incorrectly.
- Outsourcing data and database management without getting copies of data on a regular basis. It’s sort of crazy to hand over critical data to a third party for management and not insisting that you get copies of it on a regular basis. Even if your relationship is strong, people leave companies or they stop working for you (as we see in Rob Ford’s case). Have you been getting data, models, code, documents from your vendors on a regular basis? You should.
- Using data collected for a specific reason for another reason. Allegedly this data was collected by Ford in fulfilling his duties as city councillor. I’m not sure whether that means it can be used for fundraising and vote elicitation. Sounds off to me. I wonder if all those people who called Ford asking for help with their trash collection and dead raccoon needs knew they were being added to a campaign database.
- Waiting until he needed the data to ask for it. It appears that the Ford brothers waited until it was time to campaign to play “who has the data”. It would be entirely possible (maybe even legally or ethically required) for the outsourcer to destroy all copies of the data when their work ended and the data was given back to Ford.
- Getting copies of data and losing them. It’s reported that the data was provided to Rob Ford’s brother, Councillor Doug Ford. But it appears he lost the data. That’s not good. Where are those DVDs now? Again, this indicates that private and sensitive data probably wasn’t treated with the respect it deserves.
As data professionals, I believe it’s our job to ensure that all data is properly managed and protected. That means monitoring paper and digital data, ensuring that good data management practices are followed, and ensuring that these practices are followed even when we outsource these activities. Please, go find out if anyone in your organization is doing a better job than Rob Ford is. You might be shocked at what you find.
Today is Valentine’s Day in many parts of the the world. That means either you are looking forward to a happy day full of fun and a night full of …fun… or you are are planning on catching up with Frank Underwood on Netflix. Both sound great to me.
Last year I wrote about 5 Naughty and Nice Ways to Love Your Data. This year I’m going to focus on ways you can romance your data for a stronger, more lasting relationship. So I’m assuming in the past you’ve followed my advice and have long since left the honeymoon phase of your data coffee dates. But where are you now? Are you starting to feel like maybe you need some more passion with your bits and bytes? I’m here to help.
1. Tell your data you love it. Often.
Heck, even show it you love it. Maybe one of the reasons your data has let itself go is that you haven’t told it how much you love it. Do you even remember the things you used to say to woo your data when you first met? Do you have actively managed data models: conceptual, logical, and physical? Do you give your database objects great names? Do you keep good metadata about this data? Do you follow data model-driven development? If you did all these in your early years of your relationship, are you still doing all that now? Are you doing all this in a modern way, not just the way you did it in 1980? Do you just talk a good game, but fail when it comes to actively showing it love?
Some day, when I’m awfully low,
When the query is slow,
I will feel a glow just charting you
And the way you look tonight.
You’re lovely, with your axes so true
And your bars so blue
There is nothing for me but to report you,
And the way you look tonight.
With each crow’s foot your normalization grows,
Tearing my pages apart
And that CHAR that wraps your text,
Touches my foolish heart.
Yes you’re lovely, never, ever refactor
Keep that structured charm.
Won’t you never change it?
‘Cause I love you
Just the way you look tonight.
2. Stop with the games.
We’ve all seen it in personal relationships. One person makes everything a game. Do you store your data in one format, say ZIPCodes as INTEGERS, but have to pad out all those missing leading zeros every time you have to deal with North Eastern United States postal codes? Stop doing that. Do you pretend that doing something faster is always better than doing it good enough? Forget perfect. Good enough. Do you tell management you have data models but all you really do is reverse engineer them? It’s all games.
I don’t know, I don’t know if numbers are REAL
Been a LONG night and something ain’t right
You won’t SHOWPLAN, you won’t SHOWPLAN how you feel
No DATETIME ever seems right
To talk about the reasons why CAST and I fight
It’s DATETIME to end the TIMESTAMP
Put an end to this game before it’s too late
Data games, it’s you and me baby
Data games, and I can’t take it anymore
Data games, I don’t wanna play the…
3. Know where your data lives.
Do you have an active inventory of what data resides where? No? How can you romance data you don’t know about? If a server walked out the door of your organization, how long would it take you to figure out what was on it? If a user had a legal need to access all the data the company held about a customer, would you be able to tell them? If you really wanted a happy strong relationship with your data, you’d know. Yes, it’s a lot of data to track where you data is. That’s why they invented tools that do this. And why data professionals are expected to use them.
Data is bigger
It’s bigger than the drives and they are not PB
The servers it is spread to
The bits in your drives
Oh no, I’ve duplicated too much
I set it up
That’s me in the ETL
That’s me in the database
Losing my governance
Trying to keep up with it all
And I don’t know if I can do it
Oh no, I’ve deployed too much
I haven’t documented enough
I thought that I heard you laughing
I thought that I heard you coughing
I think, I thought, I saw you cry
4. Stop faking it.
Yeah, sometimes little white lies are good for a relationship (BTW, You DO Look Beautiful!). But the big ones? Nope, never. The paranoia about NULLs often leads to a lot of lying. Do you pretend that NULLs don’t exist by giving them various fake values like 999999 or N/A, UNKNOWN, WHO KNOWS or __ ? Does every developer get to choose their own NULL Imposter Text? Are your aggregates all a huge lie due to all those zeros and 1980s dates you use to lie to your database? Stop it. It’s not helping that your queries are 2 ms faster when the data is one big lie.
Late at night a big database gets slower
I guess every normal form has its price
And it breaks her data to think her love is
Only given to a user with queries as fragile as ice
So it tells me it all adds up just fine
To aggregate the sales numbers for every town
But only the dev knows where those NULL have been killed
And it’s is headed for the cheatin’ UNKNOWN town
You can’t hide your lyin’ nines
And your N/A is a thin disguise
I thought by now you’d realize
There ain’t no way to hide your lyin underlines….
5. Protect it.
Do you l et just anyone throw code at your data without ensuring it’s treated right? Do you participate in security and privacy reviews of application code? You have those, right? Do you have metadata that describes the privacy and sensitive data requirements for each data element? Do you ensure that things like SQL injection tests happen for every application?
Oh where, oh where can my data be?
The dev took her away from me.
She’s gone to pastebin, so I’m gonna be sad,
So I can see my data, by now I’m so mad.
We were out on a date in my modelling tool,
I had been too much a fool.
There in the database, all laid out,
a data was there, the database queried by a lout.
The dev allowed the inject, the data failed to be right.
I’ll never forget, the sound that night–
the screamin users, the bustin app,
the painful scream that I– heard crash.
Oh where, oh where can my data be?
The dev took her away from me.
She’s gone to pastebin, so I’m gonna be sad,
So I can see my data when my new job is had.
Keep saying it. Keep doing it.
There’s so much more you can do to revitalize your relationship with data. But if you do these, your data will keep on loving you back. I promise. Remember, you data wants to love you back. It’s up to you to make sure it’s still there in morning.
In mid-January I came across a link to a story about a new book by Random House called Barbie I Can Be…A Computer Engineer. As you know, I travel with a Computer Engineer Barbie (@data_model) and Venus Barbie (@venusbarbie) in my work advocating that girls take more STEM courses. So let’s say I have a strong interest in making sure my wonder girl Barbie has a great book.
But the story said that the book actually put Barbie in not so great place. So I bought the book and read it. And it made me cringe. I read it a few times and decided it needed to be fixed. Or in Computer Engineering terms, it need to be refactored.
So that’s what I’ve done. In this review of Barbie I Can Be…A Computer Engineer, I will point out the parts that set a lousy role model for girls and offer suggestions on how it can be refactored to make it better. Just like in software refactoring, I’m not going to change the functionality of the book, but I’m going to improve the
code words to leave it better.
And to make it easy for you to fix you copy, I’ve included a Refactoring Computer Engineer Barbie PDF. You are welcome.
Synopsis (SPOILER ALERT!)
Barbie is working on a design for a new puppy computer game when her laptop catches a virus. Luckily, she wears a heart USB drive around her neck and has backups of her files. So she uses her little sister’s (Skipper) laptop to try to retrieve the files. Oh, CURSORS! she has infected Skipper’s laptop, too. She promises to make it all right and rushes off to school to ask her computer teacher (who is a female!) how to fix it. Her teacher gives her some tips and Barbie heads to the library to get get both her data and Skipper’s data back. She gets two friends to help and they get it done. Skipper, with her restored data, makes an excellent presentation in her class where she says that Barbie is the person she most admires. Cue tears. Barbie presents her game in computer class. She does such a wonderful job, her teacher even gives her extra credit.
Well that sounds Awesome! Isn’t it?
Sounds like a great story with good female leadership, doesn’t it? Female teacher, Barbie and friends fix the problem, Skipper and Barbie give great presentations. We need more great females to speak, right? Well, just like in database design, the Devil is in the details.
Unfortunately, some of the details really make it look like Barbie is more of a Booth Babe than a Computer Engineer. This is making the IT community cringe. Twitter is blowing up with campaigns to get the book removed from shelves or to get Random House to fix it. Well, I’m going to save Random-House the effort by
fixing refactoring it for them. It’s one thing to raise the issue, but as a designer-architect-project manager-methodologist-computer engineer, I just want to FIX it.
Let’s start with the first troublesome passage:
Computer Engineer Barbie Laughs and is Needy
"I’m designing a game that shows kids how computers work", explains Barbie. "You can make a robot puppy do cute tricks by matching up a color blocks!"
"Your robot puppy is so sweet," says Skipper. "Can I play your game?"
"I’m only creating the design ideas," Barbie says, laughing. "I’ll need Steven’s and Brian’s help to turn it into a real game."
That last line is a problem. First, saying “I’m only” makes it look like design work is some how lesser than building. I know there are some techs out there that would agree with that, but it’s still not true. In fact, in technical professions, the designer / architect is the senior position on the project. Secondly, she is laughing this line, as if it is hilarious to think that Barbie can build something. Finally, Steven and Brian are recurring characters throughout the I Can Be… book series. They are friends and friends help each other. But this passage seems to reinforce a position that boys build, girls draw.
So I’ve refactored this passage by changing out that line with this one:
"Not yet," explains Barbie. "I need to finish the design then work with Steven and Brian to turn it into a game."
See how that says basically the same thing, but it doesn’t devalue Barbie’s design work? It also reinforces the more realistic situation that teams work together to make a product. Barbie doesn’t “need help”; she is part of a team to get it done.
Steve and Brian Will Get It Done Faster
After class Barbie meets with Steven and Brian in the library.
"Hi guys!" says Barbie. "I tried to send you my designs but I ended up crashing my laptop and Skipper’s, too. I need to get back to lost files and repair both of our laptops."
"It will go faster if Brian and I help," offers Steven.
This last line could be interpreted that Steven and Brian, not Barbie, can get this done faster. I realize this is just one interpretation and the intention could be that if everyone works together, we can get it done faster. We know in software engineering this may or may not be true – in form of the Mythical Man Month. But in general, three people fixing two laptops might make this all go faster – debugging, troubleshooting, copying files and those sorts of things typically do turn out better with more people at the desk.
But I’m still concerned about the fact that the less generous interpretation could be that boys can fix things; girls just come to them with their problems. So I’ve refactored this to say:
"We can all work on this together; it will be faster," says Steven.
The work continues with this on the next page:
"I got Skipper’s assignment from the hard drive!" exclaimed Steven.
"Fantastic!" says Barbie. "And her other files as well?"
"I got everything," says Steven. "Now let’s retrieve the files from your hard drive. Both laptops will be good is new in no time!"
It’s here where the dialogue really makes it look like Steven did all the work and Barbie waited anxiously for the results of his work. So I’ve refactored these to show Barbie being more engaged in the process. Not just the Holder of the Compact Disc.
"We’ve got Skipper’s assignment from the hard drive!" exclaimed Steven.
"Fantastic!" says Barbie. "Let me get her other files as well!"
"Great! Now we’ve got everything," says Steven.
See how Barbie has a more engaged role here? No confusion about her fixing this problem, too.
One More Thing…
One of the key things that an engineer should do when disasters happen is to ensure that it never happens again. One of the steps missing from this story is making sure Barbie and Skipper’s laptops are safe from future viruses. So I’ve added a new line to a passage:
The next morning Barbie gives her sister a big surprise. Skipper turns on your laptop – and it works!
"My lost assignment! cries Skipper. "You are just too cool, Barbie. You fixed my computer and saved my homework!
"I set up new security software on both laptops to make sure this doesn’t happen again," exclaims Barbie.
Skipper gives Barbie a huge hug.
You can’t just retrieve the files; you have to ensure those pesky viruses don’t come back.
How Do We Fix the Book, Though?
I fixed my copy by refactoring the printed pages. You can do that, too. I’m sharing the Refactoring Computer Engineer Barbie PDF I created with the refactored dialogue. Just print it on sticker paper and cut out the revised sections to update your copy of the book. You might also want to head over to read that open letter to Random House, too.
I love my Technical Barbies and I want girls (and their parents) to have great role models in real life, not just with
dolls action figures. So books like this need the Best Practices in their writing. I hope you do, too.
I have another post coming about the computer security parts of this story. But for now, go fix your copy of this book. Don’t leave it sitting around in production, waiting for someone to read it when it’s wrong. Love your Data and Love your @Data_Model.
Update: It appears that this chart and other data visualizations have been removed from the website and report. I’m hoping that means that the authors will be refactoring them with improved graphics. Meanwhile, I’m going to leave my post below as is. There are good lessons and tips to be shared.
I know. I hear you. It’s still January and we might just have a winner, one that will be impossible to beat during the next 12 months. Incredible. As you may recall, in late 2011 I awarded Stupidest Bar Chart to a doozy from Klout. That bar chart was confusing, but not in the way this one is. First, put down your beverage of choice. Then take a look at this:
Yeah. That…chart. It’s kind of like a horizontal stacked bar chart. I don’t understand anything about it, though. This chart comes from an infographic at Deloitte.com on Analysis Trends for 2014.
Maybe zooming in might help?
Nope, doesn’t make it any clearer. In fact, it’s just as crazy, but bigger. Call it Big Crazy DataTM.
Here are the issues and questions I have about it:
- What do the colours mean? If this were a stacked bar chart, the grey and blue areas would indicate different data. It appears that only some sections have data. But I’m not sure.
- What is the scale? Normally a bar chart would have an axis that indicates some measure and all the bars would be graphed against that axis. This has no axis.
- Why do some bars have signed numbers and one have a range? Why are some numbers unsigned? Even some delta numbers are unsigned.
- What do the relative sizes of the sections mean? In one bar we see a blue section labeled 285, but it’s larger than a section labeled 425-475.
- Where numbers appear, do they describe the section they are on or the section next to the number? I’m not sure
- What does the relative position of the blue section mean? I’m not sure.
- Why are some of the labels in light grey and some in dark grey? I’m not sure
- What are the units of measurement for these numbers? Are some percentages? Units of 1000s? 100,000s? Are they of people? Positions? Something else? I’m not sure.
- Do the endnotes there explain the numbers? No, they are just citations for reference materials used to create the report.
Maybe the chart has an explanation inside the full document, Analytics Trends 2014: (And why some may not materialize)… No, same chart, no text that directly explains any of the numbers. To add some irony to this, the report itself is about Analytics and even covers trends in visualizations.
A Picture is Worth A Thousand Words, Unfortunately.
The report has something to say about data visualizations used in data analytics:
There’s no question that visualization has become a critical capability for organizations of virtually every shape and size. Easy-to-use software makes complex data accessible and understandable for almost any business user. From discovery and visual exploration to pattern and relationship identification, today’s visualization tools easily affirm the adage that a picture is worth a thousand words. Or, in this case, numbers.
This is especially true with big data, where visualization may even be a necessary capability for driving insights. That’s why visually oriented tools are rising in prominence for many big data applications. Users get to understand, explore, share, and apply data efficiently and collaboratively—often without the need for analytics professionals. And that’s where the risk comes in. In their eagerness to dive into data, users may choose polished graphics over thorough data preparation and normalization and rigorous analysis—glossing over important insights and analysis opportunities and potentially producing erroneous results. [emphasis mine]
Keep reading the report from that section. The irony burns.
What’s Going on with this Bar Chart?
I’d bet that the Analytics professionals at Deloitte know much better than this. The webpage and report for Analytics trends is beautiful to look at. I’m guessing that a graphics designer has taken these numbers and created a beautiful, yet meaningless graphic with numbers. And just as the report predicts, people who don’t understand how to best use visualizations can gloss over important insights and analysis opportunities and potentially produce erroneous results. This report has some great points. And it’s pretty. Very, very pretty. But the distraction of bad visualizations makes difficult for me to actually see the points the authors are trying to make.
My guess is also that this data, as a set, had no business being put together in one chart. I’m not sure, but they don’t seem to have the same measures or even be the same type of data. So putting them in one chart won’t help. This was a page in a report needing a graphic, so someone made one.
Jamie Calder ( @jamiecalder) helped me “see” the story this chart is trying to tell: think of it as a math equation. That might get you there. But it’s still not an appropriate use of a bar chart. And Josh Fennessy ( @joshuafennessy) has pointed out that this isn’t supposed to be a bar chart at all. It’s supposed to be a waterfall chart. But it’s dressed up as a bar chart, so I’m going to still leave as a contender for Worst Bar Chart of 2014. Let’s just call it a self-nominated chart. Martin Ribunal has found what is most likely the original chart from which this chart was most likely
copied inspired by and has listed that in comments below.
What Have We Learned About Data Visualizations?
- The best data analysis can be invalidated with bad data visualizations.
- If you develop content, insist that you say in the final published work. I know this is difficult in large corporate entities, but it’s important to ensuring that your goals are met.
- The more accessible we make self-serve BI and data visualization tools available, the more responsibility we have to educate, train, and mentor those using these tools.
- Show your visualizations to other people. Ask them what they see. Ask them if they are confused, what conclusions they might have and what questions they still have.
- Choose the right chart type to fit your data. Then use that chart correctly.
- If you needs a graphic image, don’t mimic a recognized chart type.
- If you add a chart to a document, you should actual reference it in the text in the way that helps the reader understand it.
- If your chart has numbers, you have to say what those are number of, including some sort of unit of measure. And your graphics should correctly portray their relative size.
- If a chart leaves viewers saying “I’m not sure” more than once, it’s not working.
- Loving your data means loving how it is presented, too.
What Would You Ask?
What other questions do you have about this…graphic.? How would you improve it?
I can’t bring myself to call it a bar chart any more. But it’s still dressed as a bar chart, so it fits the nomination category. If you find a bar chart or any other data visualization to nominate, let me know. I wouldn’t want something worse than this one to go unrecognized.
Remember this the next time you see some statistics. Or a report. Or anything that appears on Wikipedia.
My friend and debate
foe victim Thomas LaRock ( blog | @sqlrockstar) sent you some last minute gift advice for organizations who need your help. That got me thinking, so I made a list, too. I checked it three times because I’m thorough like that. I’ve based this list on observations and data I’ve collected over the last year. Much like you do. Except I didn’t use that creepy shelf elf guy.
Just in case you didn’t collect enough data, or it got lost in some hard drive failure:
For Twitter: A way to restore those 20k Tweets of mine they lost a few years ago. Oh, wait…probably better that they aren’t part of the firehose any more.
For Data Modeling tool vendors: A copy of my multi-volume set of enhancement requests for supporting new database and datastore types. It’s nearly 2014; it’s time we data architects were able to do our job regardless of the target technology.
For the NoSQL Community: A love of data so strong that they can tolerate the mere mention of relational database solutions when appropriate.
For Microsoft: A brilliant new name for Windows Azure SQL Database. I can’t keep saying that in my presentations. Even WASD is difficult.
For Oracle, IBM, and Sybase users: A community as active and helpful on social media as those in the SQL community.
For United Airlines: A set of laminated copies for each employee of your Star Alliance agreement to remind you that 125k flyers do indeed get perks on your airline. First World Problems, I know.
For Star Alliance airlines, including Air Canada: A "Wifi in the Sky for Dummies" book. Tell them to get cracking.
For NASA: That other half a penny.
For Justin Bieber: US citizenship and a ranch in SoCal. And a plane ticket. No monkeys.
For Rob Ford: NULL. Not even coal.
For my readers and their customers: A year in which their data is much loved, in the right place, at the right time, with the right granularity and the right integrity. Or World Peace. Probably easier.
(Kitty is my Starbucks name. I figure Santa knows that already.)
Subscribe via E-mail
- February 2015
- January 2015
- December 2014
- November 2014
- October 2014
- August 2014
- July 2014
- June 2014
- May 2014
- April 2014
- March 2014
- February 2014
- January 2014
- December 2013
- November 2013
- October 2013
- September 2013
- August 2013
- July 2013
- June 2013
- May 2013
- April 2013
- March 2013
- February 2013
- January 2013
- December 2012
- November 2012
- October 2012
- September 2012
- August 2012
- July 2012
- June 2012
- May 2012
- April 2012
- March 2012
- February 2012
- January 2012
- December 2011
- November 2011
- October 2011
- September 2011
- August 2011
- July 2011
- June 2011
- May 2011
- April 2011
- March 2011
- February 2011
- January 2011
- December 2010
- November 2010
- September 2010
- August 2010
- July 2010