# Beware of weak links in large datasets

One of the main characteristics of the last few years is the increasing availability of statistical information on many aspects of individual decisions and participation to economic and financial life. This has become a boon to statisticians as the basis of data-based inference, estimation of models and forecasting has been, since the advent of statistical tools that more data (more evidence) is better than less data.

Firms that harness the potential of large datasets are more at ease to evaluate the relevance of their policy decisions, the risks they incur and their consequences.

Yet, this is not so simple, an argument showing how increasing data availability does not lead to better statistical decision making can be traced to the work of Benoît Mandelbrot (e.g. Mandelbrot, 1967) who introduced the odd patterns that have been called Fractals. Their applications range all fields of human activity. Fractals are much more complex objects that were known before and Mandelbrot (and the literature since) has shown how to handle them.

In the context of data (variables) that is recorded over time, these fractals may take the form of what is called self-similarity or long memory. This is a feature of the variables that are heavily dependent on the past, more so than “standard” variables. This dependence over time unfortunately invalidates the use of the Central Limit Theorem mentioned above and hence renders estimation, inference and forecasting more difficult tasks.

Two well-known sources of such long memory are called the Joseph and the Noah effect. The Joseph effect originates from the low frequency variations in the flooding levels of Nile river: in the Bible, Joseph interpreted Pharaoh’s dream of seven fat cows chased by seven lean cows as indicating that the cows represented the produce of the crops in the Nile Valley (Hurst, 1954, was the first in 1954 to show that the Nile indeed presented long memory features). The Noah effect is that a rare event or large magnitude (e.g. The Deluge) can have effects that are felt over an indeed very long period of time. Long memory is a very pervasive phenomenon and a quick search on Google Scholar returns over 70,000 articles.

As long memory is a feature that relates to dependence over time, it is never considered when working with cross-sectional data, i.e. samples of observations recorded across statistical units (firms, agents, locations....) rather than over time. Yet, in Chevillon, Hecq and Laurent (2018, Journal of Econometrics), we prove an odd and interesting feature: we show a link between dependence between individuals (cross-section dependence) and dependence (in the form of long memory) for a unique individual object over time. We show, specifically, that long memory (for one variable over time) can arise as a consequence of very small dependence (at one point in time) between a large number of individual variables. This prove in particular that there exist situations in which the Joseph and Noah effects are two faces of the same coins, depending on how the modeler approaches the problem at hand.

Why is this important? For several reasons that involve researchers and practitioners working with large datasets, or small datasets where there exists the possibility that these data are linked to many others. This result shows how radically different big data is from small data and how new techniques have to be developed when working with large datasets. When working with large dataset, modelers have to use specific tools, when the data are possibly related (even if very weakly) to a very large number of other variables.

REFERENCE:

Chevillon, Hecq & Laurent (2018). Generating Univariate Fractional Integration within a Large VAR(1), Journal of Econometrics, 204(1), pp. 54-65, 2018.