Busted: 10 Big Data Myths Exploded
If a little bit of data is good, then a lot of data must be great, right? That's like saying if a cool breeze feels nice on a warm summer day, then a tornado will make you feel ecstatic.
Perhaps a better analogy for big data is a high-spirited champion racehorse: With the proper training and a talented jockey up, the thoroughbred can set course records, but minus the training and rider, the powerful animal would never make it into the starting gate.
To ensure your organization's big data plans stay on track, you need to dispel these 10 common misconceptions about the technology.
1. Big data simply means 'lots of data': At its core, big data describes how structured or unstructured data combine with social media analytics, IoT data, and other external sources to tell a "bigger story." That story may be a macro description of an organization's operation or a big-picture view that can't be captured using traditional analytic methods. A simple measure of the volume of data involved is insignificant from an intelligence-gathering perspective.
2. Big data needs to be clean as a whistle: In the world of business analytics, there is no such thing as "too fast." Conversely, in the IT world, there is no such thing as "garbage in, gold out." Just how clean is your data? One way to find out is to run your analytics app, which can identify weaknesses in your data collections. Once those weaknesses are addressed, run the analytics again to highlight the "cleaned up" areas.
3. All the human analysts will be replaced by machine algorithms: The recommendations of data scientists are not always implemented by the business managers on the front lines. Industry executive Arijit Sengupta states in a TechRepublic article that the proposals are often more difficult to put in place than the scientists project. However, relying too much on machine-learning algorithms can be just as challenging. Sengupta says machine algorithms tell you what to do, but they don't explain why you're doing it. That makes it difficult to integrate analytics with the rest of the company's strategic planning.
4. Data lakes are a thing: According to Toyota Research Institute data scientist Jim Adler, the huge storage repositories housing massive amounts of structured and unstructured data that some IT managers envision simply don't exist. Organizations don't indiscriminately dump all their data into one shared pool. The data is "carefully curated" in a department silo that encourages "focused expertise," Adler states. This is the only way to deliver the transparency and accountability required for compliance and other governance needs.
5. Algorithms are infallible prognosticators: Not long ago, there was a great deal of hype about the Google Flu Trends project, which claimed to predict the location of influenza outbreaks faster and more accurately than the U.S. Centers for Disease Control and other health information services. As the New Yorker's Michele Nijhuis writes in a June 3, 2017, article, it was thought that people's searches for flu-related terms would accurately predict the regions with impending outbreaks. In fact, simply charting local temperatures turned out to be a more accurate forecasting method.
Google's flu-prediction algorithm fell into a common big data trap: it made meaningless correlations, such as connecting high school basketball games and flu outbreaks because both occur during the winter. When data mining is operating on a massive set of data, it is more likely to encounter relationships among information that is statistically significant, yet entirely pointless. An example is linking the divorce rate in Maine with the U.S. per capita consumption of margarine: there is indeed a "statistically significant" relationship between the two numbers, despite the lack of any real-world significance.
6. You can't run big data apps on virtualized infrastructure: When "big data" first appeared on people's radar screens about 10 years ago, it was synonymous with Apache Hadoop. As VMware's Justin Murray writes in a May 12, 2017, article on Inside Big Data, the term now encompasses a range of technologies, from NoSQL (MongoDB, Apache Cassandra) to Apache Spark.
Critics previously questioned the performance of Hadoop on virtual machines, but Murray points out that Hadoop scales on VMs with performance comparable to bare metal, and it utilizes cluster resources more efficiently. Murray also blows up the misconception that the basic features of VMs require a storage area network (SAN). In fact, vendors frequently recommend direct attached storage, which offers better performance and lower costs.
7. Machine learning is synonymous with artificial intelligence: The gap between an algorithm that recognizes patterns in massive amounts of data and one that is able to formulate a logical conclusion based on the data patterns is more like a chasm. ITProPortal's Vineet Jain writes in a May 26, 2017, article that machine learning uses statistical interpretation to generate predictive models. This is the technology behind the algorithms that predict what a person is likely to buy based on past purchases, or what music they may like based on their listening history.
As clever as these algorithms may be, they are a far cry from achieving the goal of artificial intelligence, which is to duplicate human decision-making processes. Statistics-based predictions lack the reasoning, judgment, and imagination of humans. In this sense, machine learning may be considered a necessary precursor of true AI. Even the most sophisticated AI systems to date, such as IBM's Watson, can't provide the insights into big data that human data scientists deliver.
8. Most big data projects meet at least half their goals: IT managers know that no data-analysis project is 100-percent successful. When the projects involve big data, the success rates plummet, as shown by the results of a recent survey by NewVantage Partners (pdf). While 95 percent of the business leaders surveyed said their companies had engaged in a big data project over the past five years, only 48.4 percent of the projects had achieved "measurable results."
In fact, big data projects rarely get past the pilot stage, according to the results of Gartner research released in October 2016. The Gartner survey found that only 15 percent of big data implementations are ever deployed to production, which is relatively unchanged from the 14 percent success rate reported in the previous year's survey.
9. The rise of big data will reduce demand for data engineers: If a goal of your organization's big data initiatives is to minimize the need for data scientists, you may be in for an unpleasant surprise. The 2017 Robert Half Technology Salary Guide indicates that annual salaries for data engineers have jumped to an average between $130,000 and $196,000, while salaries for data scientists are now between $116,000 and $163,500 on average, and salaries for business intelligence analysts currently average from $118,000 to $138,750.
10. Employees and line managers will embrace big data with open arms: The NewVantage Partners survey found that 85.5 percent of the companies participating are committed to creating a "data-driven culture." However, the overall success rate of new data initiatives is only 37.1 percent. The three obstacles cited most often by these companies are insufficient organizational alignment (42.6 percent), lack of middle management adoption and understanding (41 percent), and business resistance or lack of understanding (41 percent).
The future may belong to big data, but realizing the technology's benefits will require a great deal of good old-fashioned hard work -- of the human variety.