Based on many types of research we know that the weather has a strong influence on food sales as weather impacts the emotional state of people and can affect their purchase decisions. As a result, many retail food chains started using weather data to forecast short-term sales predictions to minimize stocked or expired products and avoid missing sales. In this work, we built a system to predict food sales for a chain supermarket in Sau Paulo, Brazil. We took into consideration, the temporal granularity of sales data, the input variables to use for predicting sales and the representation of the sales output variable. We were able to decide which machine learning algorithms suites better food sales prediction and used appropriate measures for evaluating their accuracy and showed success in predicting sales of some weather-sensitive products such as beverages.
In today’s highly competitive and constantly changing business environment, the accurate and timely estimation of future sales, also known as sales prediction or sales forecasting, can offer critical knowledge to companies involved in the manufacturing, wholesale or retail of products. Short-term predictions mainly help in production planning and stock management, while long-term predictions can help in business development decision making. In our specific case, we have a fast-food chain in Brazil with 400 stores with difficulty in predicting its short term production.
Sales prediction is particularly important for this particular company due to the short shelf-life of many of its products, which leads to loss of income in both shortage and surplus situations. Producing too many leads to waste of products, while producing too few leads to opportunity loss. Therefore we have a situation where predicting correctly how much you have to produce of each item each day is important.
Moreover, food consumer demand is constantly fluctuating due to factors such as price, promotions, changing consumer preferences or weather changes. Sales prediction is typically done arbitrarily by managers. However, skilled managers are hard to find and they are not always available. In our specific case this forecast is done based on experience however it remains far from accurate. Average loss (too many or too few) spins around 10%.
Here it is important to put the management perspective: It is their view today they relay too much in the managers (e.g. they may get sick or take a leave) and would like to have a computer systems that can play the role of a skilled manager. Over time the expectation was to have some tool which would free the company from the human dependence. In addition of that they believe that the level of error is high and could be reduced.
Therefore, from the management perspective a system capable of predicting the sales would be worth having even if at the beginning it doesn´t perform better than the current process. Equal performance would be acceptable. In addition of that there was an understanding that the system would be able to improve its performance overtime as more and more historical data is added to its reference database. (Machine learning effect) .
The problem is how to build a model which effectively predicts the demand with a level of assertiveness equal or superior to the current one and improves overtime. One way to build such a system would be to try and model the expert knowledge of skilled managers within a computer system. Alternatively, we could exploit the wealth of sales data and related information to automatically construct accurate sales prediction models via machine learning techniques. The latter is a much simpler process, it is not biased from the particularities of a specific sales manager and it is dynamic, meaning it can adapt to changes in the data. Furthermore, it has the potential to outweigh the prediction accuracy of a human expert, who typically is imperfect.
Nevertheless, we listened to the thoughts of the people currently in charge of making this forecast and we were informed that they believe that the demand is correlated to the following factors:
- Day of the month (payment days usually have bigger demand)
- Day of the week (Fridays, Saturdays, Sundays and holidays usually have a big demand)
- Month (Holidays months usually have bigger sales – In Brazil Dez-Jan-jun-Jul)
- Weather (temperature, rain and sun have an impact in what people eat)
This particular company sales its food through several channels: 1) directly from its stores, 2) through a web deliver service and 3) through a call-center. In our study we are not going to differentiate these channels just counting the total volume sold of each item each day.
Here it is worth mentioning that the insigns provided by the people in charge of the process today should be seen with a grain of salt given the fact that they know that a system like that would be built to replace them. Therefore all these assumptions must be checked against the hard data.
To be able to build our model we decided to use as a sample the sales in the city of São Paulo which alone responds for almost 40% of the total. This is a simplifying strategy given the fact that if the process works for this city we easily deploy it in the others.
Getting the data
Initially we managed to get from the company the sales by type of item per day for thirteen months (jan – 2018 and Jan-2019) – 396 registers .
Secondly we managed to get the weather stations measurements in São Paulo for the whole year of 2018 and January of 2019. It is public information available at the website:
In sequence we prepared this data crossing these two files unifying them by date.
Evaluating the data
Accuracy and completeness: A visual inspection showed that the sales data per day was basically correct, although some values seem to be too high or too low. The weather measurements had some problems of completeness. There were several days without the insolation, temperature and humidity recorded. In addition of that there are several days where the level of rain is zero, this is a problem because we don´t know it happens because it wasn´t recorded or because in fact didn´t rain in these days.
Cleaning the data
To deal with the inconsistences we defined three strategies:
- Regards the missing values in precipitation we managed to see the average rain in each month (public information) and check it against the sum of each day/month in the database. Through this process we managed to identify that in fact the zero represented days without rain (the data was right).
- The registers with temperature min, max or med and humidity equals zero were filled with the mean of these parameters (just two samples fall into this scenario).
- In the case of the sun intensity we had a situation where 196 out of 396 samples were equal zero. Considering that there is no possibility that the sun didn´t appear for so many days it was assumed we had a problem with the data. We managed to check the average sun intensity per month in the city of São Paulo (public information) and fill the gaps manually. Subsequently we identify that the mean of the registers with measurement represent the average therefore it was possible to implement a code to correct it automatically.
- The values of sales which diverge too much from what is typical were treaded by a bell distribution were the values whose frequency were smaller than 1% were eliminated from the sample. Note that we did not eliminate the whole row ( Each row had sales for each one of the eight types of products) we just didn´t count the line when predicting the specific item.
The evaluation was important because allowed us to create a cleaning layer in the R code where we check these factors (2 and 3) and adjust it automatically – It is important because we assume that a new samples will be added to the training data as time goes by and this new data probably will suffer from the same problems.