Structural Modeling - From Correlation to Causality
"The difference between a data newbie and a scientist is that the scientist comes to conclusions with extreme reluctance."
- Cassie Kozhyrkov, Chief Decision Scientist, Google
I spend a ton of my time observing data science students, professionals, professors, and researchers. In most situations, I see data scientists running their datasets through many models, searching for the elusive one that delivers the best fit. Unsurprisingly, this "throw everything and see what sticks" method leads to erroneous recommendations. Very little thought is given to the business situation, risk, and economic theory behind the chosen modeling approaches or recommendations. Additionally, recommendations are made based on correlations when the business problem demands a causal explanation.
This focus on models that offer causal explanations based on econometric theory is what distinguishes a decision scientist from a newbie data scientist. The decision scientist goes that extra step to ensure the theoretical validity of the model in the context of the specific problem in question.
In this post, I'll introduce you to structural modeling - an important toolkit in the decision scientist's toolbox.
What are Structural Models?
Structural models differ from statistical models in that they require a structural relationship between the predictor and the response variable. Structural relationships based on proven scientific constructs. For example, demand curves establish a relationship between price and quantity.
Why Structural Models are important in Decision Sciences?
Unlike standard machine learning problems or data science problems, decision sciences is different in the sense that the analysis or model lends itself to a business decision with real costs and benefits. In trying to recommend a decision, it is important that a causal relationship (not just a correlation) exists between the predictor and the response variable. This causal model ensures two things
The causal model ensures greater confidence that an action taken will yield the desired result
The causal measurements ensure that the decision scientist takes the right step in verifying the validity of the model in the given business conditions
Decision Sciences utilize other scientific theories such as those from social science, econometrics, marketing, etc to provide theoretical foundations to the statistical models being developed.
An example of structural modeling
For the sake of argument, let's take a simple decision science problem that seeks to understand how to improve the output of a factory.
Product Sales = a0 + a1*price + a2* marketing + a2*salesforce + error
From a normal data science perspective, we would run a linear regression model, ensure that the values are statistically significant. Let's say for example that the output of our model looks like this and all the coefficients are statistically significant (p-value <0.005)
Product Sales = 1000 + 10*price + 7* marketing + 20*salesforce + error
We gather from this example that raising one unit of price would increase sales by a factor of 10. Additionally, raising one unit of marketing and sales would raise product sales by a factor of 7 and 20.
Let's admit it, all of us have gone through at least one point in our lives where we have jumped and said let's raise the sales force numbers for maximum impact.
So why should we be reluctant here?
First, by utilizing a linear regression model, we assume a linear relationship between the predictor and the response variables. Is that valid for our business problem?
Second, how do we establish the causality of the model? Does an increase in sales force cause an increase in sales or is it just a correlation? Especially in the age of big data, random correlations can arise in the dataset.
What about the error term? The assumption of the regression model that the variable errors are independently and identically distributed. Could there be business circumstances that lead us to believe the error might be correlated with any of the variables?
Are there causal scientific theories that refute the model we have built? If so, how do we know our model is not an anomaly and is in fact a robust one that can stand the test of time?
As you can see, even in this simple model, we can quite easily raise questions that very quickly makes us realize that the solution does not end with a statistically significant model.
In this specific example, we see price and demand positively correlated with each other. A simple look into demand models in economics would tell us that economic theory suggests the opposite. Essentially, while the model is statistically significant, it is structurally invalid. There is either something special about the use case we are dealing with (in that case, lets publish a paper) or we need to investigate the variables or the data or explore what dimension we are missing in the model.
Hopefully, this short description piques your interest in Decision Sciences and the wonderful world of structural modeling. This is the first post in my series on structural modeling.
In the next few posts, we will start diving into case studies and understand how structural modeling is implemented. As part of this series, we will also explore scientific papers and discuss them ( in plain English) so you can grasp the theory and apply it to your own decision science problems. Follow me on LinkedIn for updates on this series.