Top 5 Challenges Faced By Data Scientists

No Data Found

The biggest challenges in data science are what happens around it: finding it, accessing it, shaping it. This article from Tilores explores some top problems in data science.

The truth, the whole truth, and nothing but the truth. Ah, if only solving the issues of data science could be as simple as swearing an oath in court.

It’s perhaps the most groan-inducing part of any data scientist’s life: how little time they spend actually doing science on data. But while many people accept it’s a bore to gather data from all its sources and silos, it’s less clear that this problem is plural. There are misconceptions about what a data scientist’s role rightly covers; how getting the right answers depends on setting the right questions; how the shape and form a dataset arrives in influences the outcome.

Data is contextual, after all – and its inbuilt biases are a clear and present danger if not addressed.

So here’s a brief summary of some major challenges facing those who wrangle and mangle data analytics today. If you’re not among them, it may help you understand why they’re constantly short of sleep and missing clumps of hair. And if “Data Scientist” is your job description, sharing this article may help others understand what you’re up against.

Here’s what we think are the biggest challenges in data analytics.

Challenge #1: digging it up in the first place

There’s a reason it’s called “data science” and not something less difficult-sounding, like “data sport” or “data fun”. (Although we’re not against those options at Tilores.) The scientific method is all about finding fundamental truths in our messy reality … and the research to uncover those truths is often arduous. That’s science: doing it is painful.

Marie Curie won the Nobel – but died of radiation poisoning for her efforts. Carl Sheele noted the toxicity of chemicals – by, er, eating them. Isaac Newton stuck nails in his eyes to research optics. And you can guess what happened to the inventors of the diving bell and parachute.

Data is everywhere in your business, but that doesn’t make it easier to work with. In fact, the No. 1 problem of data analytics is just getting your hands on it. A large percentage of any DS’s life is spent tracking down who owns what and getting permission to use it, stymied by multiple layers of bureaucracy: privacy legislation, company T&Cs, and the worries of individual data owners within the organisation. Including, sadly, knowing whether the data you’re looking for even exists in any form.

This is the biggest data problem of all. We can’t – sorry – provide a complete solution. But if you read on, you might find an idea that helps.

Challenge #2: the ballooning role of the DS

Because few people know precisely what your data scientist does, there’s a perception that she does everything to do with data. Including down-and-dirty data preparation tasks like formatting cells on that giant spreadsheet owned by Ivan in Accounts.

Obviously it’s not optimal to use a high-skilled person (and often a high-paid one) to do basic data cleansing tasks – but sometimes it’s only the data scientist who knows with any certainty how to clean that data. So the job often lands on the DS’s desk with a thump.

Data prep and cleansing is hard labour, repetitive and unpleasant. And that’s not what you pay the average data scientist’s £47k salary for. Here’s a teaser: what if you could sidestep the problem by doing data analytics on the data as-is, in its original format? Again: read a bit further.

Challenge #3: seeing the data in context

Data is rarely a simple set of rows and columns; it has layers. This is a third major headache for data scientists: understanding where that data came from, the purpose it was collected for, and what might be missing from it that absolutely shouldn’t be.

(Remember the story of WWII aircraft coming back with bullet holes in consistently the same places, so managers beefed up the armour plating on those parts? It sounds sensible. But it misses a key bias in the dataset: it’s only about aircraft that made it home. There’s no data on where lost aircraft were being hit by hails of lead – which would’ve led to far more accurate decision making.)

Context, of course, involves looking at the Big Picture: all the conditions and conventions shaping that data. And that’s a further clue to how to deal with it. But before that, two more challenges.

Challenge #4: contradictory data, conflicting sources

You’ve got what looks like the perfect dataset of customer information: up-to-date, complete, no missing fields or records. And then you get bushwhacked by a dataset containing the same list of names – but with very different records. Instead of a Single Version of the Truth, you’ve got Multiple Versions of what might be True.

Do they have equal value? Are they equally accurate, but in different ways? Are the records for a given name or company even the same person or organisation? Between total accuracy and total uselessness is a vast landscape of not-quite-wrong. And it’s another source of sleepless nights for those who analyse data.

Continuing our ongoing tease: what if you could use that not-quite-wrongness to your advantage, instead of it being an obstacle? Yes, that’s where this blog is going.

Challenge #5: being conscientious about confidence

Data scientists are cautious people, not prone to making unsupportable claims. Scientifically, this is the right approach. But business units are often tone-deaf to anything less than strident certainty. And rarely like to be told their cherished strategies are built on shaky foundations.

This is our last challenge for data scientists: making their quietly-reasonable voices heard amid the hubbub of humblebragging that makes up much of the average business. It wouldn’t be so bad if that uncertainty could be quantified. But often getting to that 95% CI is a huge task in itself.

We’ve been teasing you too long. Here’s the Tilores answer to many challenges facing today’s data scientist. It’s called entity resolution.

Entity resolution in a nutshell

ER takes the downsides of data – existing in silos, the need to prep and clean, its missing context and contradictions, and the trouble of supporting your findings – and turns them into upsides. It does this by looking at all your data, in its original form, with all its duplications and shades of grey … and seeing what links it.

Perhaps the Engineering team has a database of suppliers, Product Design has another, and Purchasing has its own, too. Each in a silo, each with a different set of fields per record. But by looking at all three at once, entity resolution sees the connections between disparate records. Engineering’s “Arc Metals” shares a postcode with “Arcolan Metal Fabrications” from Purchasing; it concludes they’re the same company, or “entity”. While Product Design’s “John from Arcolan” adds people to that organisation, building a single view of the data all departments can share.

Most importantly, entity resolution (the Tilores flavour of it, anyway) does this in real time with low computational overhead. Answering several of the issues above – since it connects data without the need to transform it, reformat it, or in some cases even move it.

That’s entity resolution. It won’t answer all the headaches of data science and data analytics. But it can stop them turning into blinding migraines. If that sounds of interest, check out https://tilores.io/.

Posts

Explore Similar Articles

The API to unify scattered customer data in real-time.

service@tilores.io