Two Different Environments
In this issue, I am going to talk about a common blind spot that most training bootcamps do not really cover much. For what reason why bootcamps do not cover it, I do not want to speculate but I hope to help my subscribers, who have put their faith in me through subscription to understand it better.
For this issue, I want to discuss more on model building and implementation and how inexperienced data scientists can fall through the gap.
Model Training Environment
In the model training environment, the dataset will usually contain all the different variables that are expected from the business process i.e. you are seeing the end “product” of the business process of interest, and all the data that are available to capture are captured.
Why is that the case? Assuming that there is certain data maturity in the organization, all the data captured in the organization should ultimately flow into a single data warehouse. The single data warehouse will have relevant datasets flowed to the different data marts, available to be extracted by the data scientist as a whole. Thus the data scientist will have the full dataset that he/she has access to.
From here on, the data scientist will start their model training phase, experimenting with the different dimensions and also hyper-parameters of the chosen modeling methods. However, they may be “blinded” by the fact that the dataset that they have on hand to train the data is the sea, which many tributaries of water have built up to.
Here is a summary look:
Model Implementation Environment
Let me explain further here. Assuming that the Final Model contains three variables, X1, X2, and X3 and the dataset that contains these three variables is called Dataset A.
In the actual business environment, this might happen.
X1 was collected way before when the model was intended to be used for decision-making. Ok no issues, X1 will feed into the model since the model is deployed later on in the business process.
X2 however is collected only after the model is deployed. What does this mean? Well, in the model training environment, X2 is available for sure, however, it is NOT available in the implementation environment to be fed into the model for decision-making. Alas, no matter how significant X2 is in the final model, it cannot be used at all…this means that a new model will have to be built…and all the validation effort and model’s implementation approval will have to go through again! Urgh!
Now X3 is a composite variable, that was gotten by combining X4 and X5. Good news! X4 and X5 are similar to X1 where it is collected before the model is deployed…however, X5 is on a different branch of the process that goes back to the original process…however from a data pipeline perspective, X5 is only available after the model is deployed…which means X3 also cannot be used!
Here is a summary:
The Two Environments Circumstance
The point I want to illustrate here is this. There are two environments that model builders need to pay attention to when training any models. In the model training environment, data is static and the dataset you see is like a sea. What you see is what you get.
However, in a model implementation environment, where you are deploying the model, the columns of data that seem static in the training environment are now the tributaries of water that flowed into the sea. A model builder will have to consider the flow of data columns in the business process, where the model will be deployed for decision-making before finalizing the model.
That Blindspot
This is the common blindspot that most fresh data scientists fall into…again no speculation on why such blindspots are not shared during the programs and bootcamps they attend.
As a model builder, be mindful that although we are seeing the ‘sea’ of data during the model training, in the actual business environment where we are deploying the model, data comes in like tributaries, building into rivers and the sea. The model builder will have to be mindful when selecting the columns of data for model training, or at least consider where the decision-making model should be deployed in the business process, for maximum impact and also all the variables are available for decision calculation.
What are your thoughts? Do comment below! :)
If this issue resonated, consider the following past issues:
Consider supporting my work by commenting, liking, sharing, subscribing, and making a “book” donation at buymecoffee! Thanks in advance!