Your company wants to “do data science” ** or “become a data company”. These are definitely noble and well intentioned goals. Not trying to be negative, but I’m going to bet that your company probably isn’t ready for data science. And that’s perfectly OK.
For the past several years, I’ve led data initiatives for companies that want to leap into being “data companies” that “do data science”. In my experience, the vast majority of companies I’ve seen are nowhere near ready to fully leverage the power of data science. Somehow, these companies have been told that they need to embrace data science in all of its glory. But they haven’t been given the guidance on how to succeed.
Thankfully, incorporating data science into your company is an achievable goal. In this long winded, high level post, I’ll pick up from my prior thoughts about how I think companies embarking on data science should take a step back and get some perspective. Future posts will explore these points in more depth. Successful data science comes down to getting back to basics with data.
Get Back To Basics With Data
A lot of my recent work has involved helping companies get back to basics with their data. In several engagements, I was originally brought in to “do data science work”, but it became immediately clear that a lot of foundational work needed to occur before moving on to more sophisticated efforts. Get back to basics. In most cases, the initial task of doing data science was put on hold in order to focus on basic stuff like culture, data quality, and data architecture.
The key ingredients to succeeding with data – data science or otherwise – are a culture that supports data-driven thinking, and decision making using high quality, trusted, and readily available data.
Of course, it’s not enough for me to pontificate about how companies are doing data wrong. I’m going to make some suggestions for succeeding with data in general, and data science specifically.
As you will notice, I place more weight on people, versus technology. In my experience, technology and data science are the far easier parts. Without the right people aligned across a supportive culture, very few things succeed – data or otherwise.
Ingredients For Data (Science) Success
Here are some key points (in order of importance) to determine if you’re ready for data science, along with some suggestions if you’re facing a gap in any of these areas.
Executive support is a mission critical factor for success; I consider it the #1 factor for data success. Support must come from the very top, preferably the CEO. And support can’t just be superficial. The exec(s) must really understand WHY data science will improve the company, the hard work, time, and investment needed to get there. And they must dedicate resources to fulfill this vision.
Executives set the tone and example for the rest of the organization. I’ve seen this top-down approach work because the executive has committed her personal capital toward seeing data succeed. She will ensure efforts succeed and remove any obstacles, such as political silos and other land mines that can sink major company initiatives.
Get executive support early. Keep the executive informed of successes and roadblocks. In turn, the executives will make data transformation a top company priority.
Data driven culture
The excellent book “Winning With Data” emphasizes how critical it is to build a data driven culture. When people can make data-driven decisions, they begin to have high demands upon data. They become data addicted. This feedback loop of data addiction and data satisfaction feeds on itself, to the benefit of the organization.
Contrast this with gut-driven organizations I’ve seen, where success with data happens mostly by accident. Usually people making gut-driven decisions are either intentionally ignorant of data, or they’re just plain ignorant. Either way, this is not an environment where data or good ideas survive. The gut-driven approach will only work until it doesn’t.
In this hyper-competitive, data-driven economy, gut-driven companies need to either become data-driven, or quickly risk becoming ever slower and dumber versus their data-driven competitors. I don’t think any executive or employee consciously wishes this fate for their company. Yet gut-driven behavior continues to manifest itself, likely because it’s an easier default short-term.
A good litmus test to determine if your company is data-driven is to observe if people are making decisions based on data versus anecdotal evidence. If you see an overwhelming amount of the latter, then your company likely is not data-driven.
If your company is not data-driven, then you need to do an honest evaluation of whether a data-driven culture is possible to create within the company. If change is possible, get executive sponsorship to create the change. However, if the culture is incapable of change – you’ll also have a good idea, based upon the executive support – there is a very good chance the company’s culture cannot become data-driven. In this case, you need to make an honest individual decision about your own future with this company. Company culture is everything, and it can easily make or break the success of new initiatives and your career.
It goes without saying that data is raw ingredient for data science. Yet, I’ve seen many data initiatives get sidelined by data-related issues. Here are a few things to be aware of.
- Data sources. Are the data sources high quality? A easy way to find out is by asking if people trust the data. If they don’t, they’ll likely give you a laundry list of specifics. Start by analyzing these issues to verify if there are in fact data issues. If so, I highly suggest figuring out ways to solve the problem of how bad data enters the source systems. It’s far easier (and common sense) to get good data by enforcing high quality data entry or data ingestion habits at the source.
- Data inventory and systems. How does your company generate data? Is your data fragmented? Is the data structured or unstructured? What frequency is the data generated and available? Does your data come from a transactional system, or do you have a reporting system as well? Take inventory of how your data moves through your organization. A data steward system can help with this, which will introspect your various data sources and show your how your data is interrelated. You need a good understanding of how the data is related, across the organization.
- Data definitions and metrics. Do people in your company have various definitions for the same thing? I remember a humorous (and macabre) incident where executives at a former employer met to define “what is a customer?”. The meeting was supposed to last an hour. Wrangling ensued – “A customer is someone who has purchased from us in the last 90 days”…”A customer is someone who has ever brought from us”. Ultimately, it took several weeks for the executives to settle on having multiple definitions of a customer, as it was impossible to arrive at a single answer. As you can see, having proper data definitions are incredibly important and insanely underrated. Resolve data definition issues early and often.
- Current reporting systems. What reports are people using? Are the reports mainly Excel-based? If not, what reporting system do you use? Are the reports correct? How frequently is the data available for reporting? All of these questions will help you assess the analytical capability and maturity of your company.
- Types of analytics. If your company are using reports, are these reports descriptive (what happened in the past), or predictive (what may happen in the future)? What’s your sophistication level? Do your analytics involve simple arithmetic (adding numbers, dividing others), or do they involve rigorous advanced statistical methods? These questions will frame where your company is along the continuum of data science sophistication.
Once you’ve assessed the state of your data, you’ll have a good idea of what technologies you’ll need.
If your data is an less than ideal you’ll want to invest in cleaning up existing data, as well as ensure that the quality of new data isn’t compromised. Tools that help with data quality, master data management (MDM), data governance, and data stewardship are tremendously helpful. Clean data is essential for successful data science, so you want to spend a lot of time and effort to make sure you’re producing the highest quality data possible.
If your company has clean data, you will want to consider your data science goals. Do you want to produce better descriptive analysis? Do you want a real time AI to categorize images for your consumer mobile app? Are you planning to make batch predictive models to identify which customers are likely to churn? The data science possibilities are endless, and the specific technologies can differ.
In a general machine learning framework, you will want a data pipeline that can feed data to a predictive model, which will output predictions. This pipeline can be capable of handing ingestion of batch data, real time data, or both. Your predictive model will need to be created, deployed, re-trained, and re-deployed.
This leads to the question of how much you want to rely on managing your own technology stack versus 3rd party managed solutions. Will you want to architect and maintain this data pipeline in-house, in the cloud, or use a comprehensive 3rd party solution such as Google Cloud Dataflow? Do you want to handcraft this model, or do you prefer offloading this task to an automated system like Google Cloud AI, Data Robot, AWS Sagemaker?
Again, specifics will vary depending on your use case and goals. Your resources and talent will be the constraints on these decisions. My personal opinion is that unless something is a core competency or differentiator for your company, it’s better to use a managed solution. This allows your data team to focus on leveraging technology to be domain experts, rather than being accidental technology experts who have little time to devote to adding value to your business.
Finally, you’ll want to make data consumable by the organization. I suggest making data as self-serve as possible. There are two paradigms in analytics – ask any question of the data, and get answers to certain questions. Do both. Give the company access to a common data repository where both savvy and non-savvy users can ask whatever they wish from the data.
Make your data easily consumable. If you’re doing real-time analytics or predictions, have a consumption layer for this data, including real time pings on mobile devices, alerts on BI dashboards, emails, Slack alerts, or anything else that makes sense for your organization. For data that is less dynamic, business users will typically consume reports in BI dashboards (mobile and desktop) and spreadsheets.
Side note – make your technology decisions flexible enough that you can avoid lock-in with a particular technology. These days, data technology changes extremely quickly. Best practice – separate your data storage from the compute and intelligence layers.
Now that you’ve assessed your data and technology needs, you will need people to make your data initiative a reality. Good data talent is scarce; great data talent is even scarcer.
There are plenty of great articles on building a data science team. Here’s a good read on the cast of characters you’ll need for a data science team.
I’ll go through some key things I’ve experience when I built data teams.
Manager. A good team needs a good manager. This person should be good at directing a team of independent and smart people, and let them do what they’re good at. I personally don’t consider myself a great manager, in that I don’t like to handle day-to-day team issues. But I do have a knack for finding people who can self-manage. At that point, I simply make sure the team is moving in the right direction and get out of their way.
I’ve seen the opposite approach – micromanaging and getting in the way of the team – be a disaster. Always remember, smart talent is on lease.
Data scientist. This gets tricky. For a good explanation of the types of data scientists to hire, have a listen to Ziff.ai’s podcast on Hiring Type 1 or Type 2 Data Scientists. It’s definitely worth a listen.
If you’re using a 3rd party data science like tool like AWS SageMaker, Data Robot, Google Cloud AI, or similar, then I suggest hiring for domain expertise, versus someone who is expert at tuning models. As data science automation continues to improve, the dark arts of model tuning will not be as important for most businesses. Much of the model tuning will be automated.
The more common paradigm will be using a data scientist’s ability to combine their domain expertise and use AI as a complementary tool.
But, if you have a need to hire a data scientist for deep algorithm work, the race for top talent is insane. And if you hire the top performers, remember their time is on lease. With great certainty, they will leave your company.
One other thing to note – I see a lot of ”data science” job postings that look suspiciously like data analyst positions. In a lot of cases – especially if your data science needs aren’t earth shatteringly sophisticated – you can probably get away with hiring someone to do analysis. This analyst likely already has requisite skills to easily transition into a “data scientist” role.
Data engineers. I’ll argue that a lot of the hard work in data science is actually the blue collar work of data engineering – data pipelining, data munging, etc. It’s not as if your AI models simply write themselves and output predictions from thin air. Models need data. And data engineering is the area where you hear data scientists spending 80%+ of their time.
Data engineers help move, transform, clean, and productionize data. They move your data science initiatives from being toys in Jupyter notebooks to full-fledged data pipelines that move high quality data from point A to Z. Without a solid data engineering team, you have data toys.
In house or outside talent. You need to consider whether you have data talent in house, or if people on your team can be trained. Thankfully, if you need to train in house talent, there are plenty of excellent resources available. MOOCs such as Coursera, Udacity, EdX, datacamp, and many more provide a good place for leveling data skills.
If you need to hire outside talent, I suggest taking an honest look at whether your company has the reputation and environment that will let smart people do their best work. What does your Glassdoor rating say about your company? What about your company’s LinkedIn? Are you attending data and technology meetups and giving back to the community? What do your employees tell you about their workplace experiences? Is your company data-driven? Good candidates look at these data points very carefully.
Top data talent have nearly unlimited high-paying options. And these are precisely the people you need for your data efforts to succeed. Don’t skimp on talent, as hiring weak talent will only ensure that you’re unable to attract top talent. Top talent and weak talent are like oil and water; top talent will feel underwhelmed, and weak talent will likely try to sabotage the stronger talent. Instead, level up your organization’s reputation and data-driven culture (if possible) so top talent – and their peers- will flock to work there.
You’ll need to take inventory of your level of executive support, culture, data, technology, and talent. Figure out how these relate to your data science goals. It’s very likely that your long term goals are a year or more in the future. You’re going to need to make a roadmap for how you achieve the people and data goals.
Get key stakeholders – executives, consumers of data, producers of data – in a room for a roadmap discussion. This should be a physical meeting if possible. It will take most of the day. But this meeting is critical. Make liberal use of a whiteboard and have a candid conversation about what you need, when you need it, and how you will get there.
Make the roadmap visible to the entire company. Announce your plan, timeline, and key stakeholders.
Keep the roadmap in a place visible to the organization. It could be as simple as printing out the roadmap and posting it in a high traffic area in your office. Or create a page in Confluence or a similar Wiki, where you keep a current version of the roadmap.
As the roadmap evolves – and it will evolve – communicate these changes to the organization and update your visible roadmap.
Getting data science right is difficult. But the hard work and lumps are worth it. Get the key ingredients in place – executive support, culture, data, technology, and talent. By taking the long view and treating data as a transformative investment in your company’s future, you’ll succeed with data science.
** As a reminder, “Data science, also known as data-driven science, is an interdisciplinary field of scientific methods, processes, and systems to extract knowledge or insights from data in various forms, either structured or unstructured, similar to data mining.” – Wikipedia https://en.m.wikipedia.org/wiki/Data_science
The operating word is data. What I’ve seen is that despite having the brainpower and technology, a lot of companies have a weak culture around data, as well as questionable data.