Earlier studies by Yi He focussed predominately on statistical issues. Nowadays, he is more concerned with data science, acknowledging that, at the end of the day, he is more of a data scientist than mathematician. So what does he see as the main difference? 'I base my ideas and theories on real-world situations and events. The research done by mathematicians is at a much higher level of abstraction,' the ASE lecturer explains.
Challenging traditional ways of thinking and working
His breaks into a smile when he talks about discussions with colleagues addressing the way data scientists think and work. 'Many data scientists use very clear and understandable mathematical solutions to formulate answers to concrete financial questions. But when you look closely at these mathematical solutions, you notice that they’re actually rather simplistic ‒ to the point of not giving you a very reliable answer. So I cast doubt on such solutions. To me, data science is all about using data in the best possible way to answer questions. And I believe it means moving off the beaten track. It really feels great to be doing this kind of research.'
The world today looks different from the world yesterday. When we use yesterday’s data to say something about today or tomorrow, we create illusions.
One of the things he is examining is what he refers to as the 'big data problem'. He goes on to clarify: 'We often collect a lot of personal data about a great many people in a very short time. But the world today looks different from the world yesterday. When we use yesterday’s data to say something about today or tomorrow, we create illusions. The answers might well be very appealing and easy to understand but they’re wrong because the data used was already out of date and the tools that processed the data just assume it’s correct without making sense of it.'
Predictions and averages
The lecturer continues: 'We very often work with averages since these are simple and clear. But each average has a small margin of error. If you reduce big data sets to a large number of averages, then the error margins end up being huge. And if you then use these ‘wrong’ averages to make predictions, the margin of error keeps growing and we create answers that are even more wrong. These are the illusions I referred to earlier.'
This is why he argues that data science is in need of new ways of working. 'But what the right method might be depends on the data set and the question to be answered.' He illustrates with an example: 'Econometrics often makes use of weights to determine, let’s say, the best investment strategies. These weights lead you to invest more in some products than others. But the weights themselves are based on previous data analyses. And these also have minimal ‒ but relevant ‒ error margins. As a result, weighting sometimes fails to give an optimal investment strategy. Even worse, it sometimes turns out to be financially wiser to spread your money equally across all investment products!'
In other cases, a weighting methodology does prove to be a shrewd strategy. When should a data scientist use a particular method? That’s what he is examining at present.
Step one: develop a new mathematical theory
He acknowledges that all predictions can be expected to have margins of error. But in some cases, more complex methods work better while in others 'simplicity is king'. Which method should be applied to which data set? 'That’s the issue we’re now trying to resolve.' And He is doing so with a new mathematical theory he is currently developing. 'It’s not after all simply a question of how, but just as much one of why. We need to understand why complex data models give us the best predictions in some situations but not others.' The lecturer concludes: 'It will save us a lot of time ‒ and spare us many illusions.'