Software systems and computational methodsReference:
Methodology for assessing the risks of fulfilling government contracts using machine learning tools
Abstract: The subject of the research is the development of a software package for intelligent forecasting of the execution of government contracts using machine learning methods and analysis of unstructured information. The object of the study is the process of control and decision-making in the field of public procurement, including the selection of contractors, the execution of contracts and the assessment of the timing and cost of their implementation. Special attention in the study is paid to the development and application of interpreted machine learning methods to solve the problems of assessing the risks of choosing an unscrupulous contractor, the risks of non-fulfillment of the contract on time and forecasting the likely timing and cost of contract implementation. The authors consider in detail such aspects as a unique set of data that was collected from various information systems. They have also developed automated data collection and update systems that can be installed on customers' servers. The methods of machine learning, analysis of unstructured information and interpreted methods were used in the work. Interpreted machine learning models were built to assess the risk of choosing an unscrupulous contractor, assess the risk of non-fulfillment of the contract on time, as well as assess the likely timing and cost of contract implementation. A unique set of data was collected in the work, including more than 83 thousand data on more than 190 features from various systems, such as the Unified Information System (UIS) Public Procurement Register, the Register of Unscrupulous Suppliers (RNP) EIS and SPARK Information System. Automated data collection and updating systems have been developed that can be deployed on customer servers. In the course of the study, software packages were developed for intelligent forecasting of the execution of government contracts, which provide an opportunity to conduct a more accurate risk analysis using unstructured information analysis methods, machine learning models and interpreted methods. This makes it possible to increase the effectiveness of monitoring the implementation of government contracts and reduce the likelihood of corruption and violations. The study demonstrates the importance and applicability of machine learning methods and models in the field of public contracts and provides new opportunities for improving control and decision-making processes in the field of public procurement.
Keywords:government contracts, machine learning, intelligent systems, data visualization, the method of support vectors, risk assessment, forecasting, regression analysis, interpreted artificial intelligence, ensemble methods
This article is automatically translated. You can find original text of the article here.
The article was prepared as part of the state assignment of the Government of the Russian Federation to the Financial University for 2023
Issues of contract activity may have different degrees of importance depending on the context and the specific situation. A number of aspects are important, among which the definition of the goals and obligations of the parties are mandatory; payment terms; guarantees and responsibilities of the parties in the contract; confidentiality and data protection; dispute resolution in case of disagreement between the parties; terms of modification and termination of the contract, which is important for flexibility and adaptation to changing circumstances; compliance with legislation and regulation. The importance of these issues may vary depending on the type of contract, industry and place of contract activity. When preparing and concluding contracts, in order to ensure clarity, protect the interests of the parties and reduce potential risks, it is recommended to carefully consider all these issues.
A. J. Geller concludes that in order to reduce the risk of termination of contracts, it is necessary to improve the procurement system, introduce effective control and supervision mechanisms, as well as improve the qualifications of participants in procurement procedures. The research of the article provides useful information about the reasons for termination of contracts in the system of state and municipal procurement of the Russian Federation, which can be useful for assessing and managing risks in the implementation of government contracts. 
In the work of M. M. Zolotukhina and N. A. Polovnikova , a study of the risks faced by enterprises when choosing suppliers and concluding contracts was carried out. It is noted that the wrong choice of suppliers and ill-considered terms of the contract can lead to serious financial and operational problems, including high prices, poor quality of goods or services, non-compliance with delivery dates, etc. The article presents various types of risks, such as financial risks, operational risks, risks associated with insufficient knowledge about suppliers and their reputation, as well as risks associated with changes in the global and local economy. To manage these risks, it is proposed to use the analysis of the supplier's financial condition, monitoring and evaluation of the supplier's reputation, the use of various types of contracts, etc.
Article by Yu . B . Gendlina et al . The article is devoted to the study of risks associated with the conclusion of construction contracts with municipal customers. The authors pay attention to the impact of the municipal context on risks and their management features and offer recommendations and strategies for managing risks associated with construction contracts. Such tools as a detailed analysis of contract terms, regular monitoring and control of work performance, the use of insurance and guarantees, as well as reasonable change management and risk control are considered. 
In the study of M. Y. Aleynikova and D. A. Golovanov, models of improving the internal control system in the implementation of public procurement in the Russian Federation are considered. The article discusses various models of improving the internal control system used in Russia, offers solutions based on a process approach, a risk-based approach, the use of information technology and automation, as well as strengthening the control and independence of internal services. The article also presents an analysis of the advantages and disadvantages of each model, as well as the experience and results of their application in public procurement practice.
In the article D. A. Eliseeva and D. A. Romanov examines the issues of using machine learning to predict risks in public procurement . The analysis of various factors (financial stability of suppliers, reputation of companies, contract execution history and other parameters) is presented to identify possible risks and problems. A comparative analysis of machine learning and traditional statistical methods for forecasting was carried out, which revealed a higher accuracy and efficiency of risk forecasting, which can be achieved using machine learning.
The study by Yu. V. Nemtsev and O. B. Mironets is devoted to the analysis of the development of the information technology market in the field of public procurement. The authors consider the risks arising during the interaction of market participants in the competitive environment of public procurement. They offer a systematic classification of risks and various methods for assessing the magnitude of these risks. In their work, they also study approaches and methods of risk management, which are important for management.
The application of the game-theoretic model of Stackelberg for the analysis of costs and risks of contract violations in the Russian mineral resource complex is given in the study of S. M. Lavlinsky, A. A. Panin and A.V. Plyasunov . This article attempts to take into account the institutional features of the process of forming the investment climate. To do this, the concept of transaction costs is introduced into the model and a mechanism for insuring the risk of breach of contract is created. The model is described as a two-level mathematical programming problem, and effective algorithms based on metaheuristics have been developed to solve it.
The risks of non-fulfillment of government contracts can be reduced at the stage of pre-selection for participation in tenders before the conclusion of government contracts. For this purpose, modern information and analytical systems implemented on the Internet are important tools. V. V. Skobelev  analyzes a serious problem and considers methods of verification of participants in the procurement tender. Machine learning methods and data mining are becoming an important tool when working with public procurement [9, 10]. Yu. M. Beketnova  analyzes contracts using machine learning methods in order to identify violations of non-targeted income, and A. Z. Mantskava  in his research uses a machine learning algorithm to classify completed public procurement.
A number of studies [13-16] are devoted to the study of civil law risks when concluding contracts for the needs of law enforcement agencies, minimizing risks and developing a system of measures to counter risks.
The study of various risks in the field of procurement and contracting is an important aspect in various sectors of the economy. Particular attention is paid to the selection of a supplier carried out on a competitive basis [1, 17]. Analysis of financial stability, reputation and quality of services, goods and reliability of supplier deliveries are the key factors in choosing. Various articles consider the risks of fulfilling contracts in various fields, such as construction, information technology and mineral resources complex [3, 18, 6, 8]. L. S. Travkina, P. V. Lisin, E. A. Mezhevikina investigate the risks of supplying products that do not meet the requirements of the contract, which will also lead to its non-fulfillment . Various methods and approaches to risk management, as well as models for improving control and preventing violations of contract legislation, are increasingly being implemented when creating information systems for the public services sector in terms of developing project risk assessment tools .
Analyzing the conducted research, we can conclude that the current relevance of the study of the risks of the execution of public contracts. There are a large number of papers devoted to specific areas of the economy and raw materials industries, as well as financial and civil risks, aspects related to suppliers and their reputation, as well as the impact of changes in the global and local economy.
From the analysis, it is possible to identify some methods and tools that can be used for the analysis and management of risks in the execution of government contracts. Among them: risk management methods, detailed analysis of contract terms, regular monitoring and control of work performance, as well as the use of insurance and guarantees. However, there are not enough studies that systematically consider all risks and use automated intelligent systems to analyze them.
In this work, a study was conducted aimed at comprehensive analysis and risk management in the execution of government contracts. The research suggests methods and approaches based on the use of machine learning and data mining, which allow us to consider risks systematically and effectively. This study fills a gap in the existing literature and offers a new approach to risk management in this area.
Methodology for the development of an intelligent risk assessment complex
The methodology of automated intelligent control of the execution of government contracts is based on the following provisions:
1. In order to successfully build an intelligent automated system for monitoring government contracts, a large data set containing the most useful information about the subject of the contract, financial and legal information on supplier companies and other useful information is needed.
2. An intelligent automated control system for government contracts should be aimed at solving three tasks: assessment of the risk of choosing an unscrupulous contractor; assessment of the risk of non-fulfillment of the contract on time; assessment of the likely timing and cost of contract implementation.
3. An intelligent system aimed at assessing the risk of choosing an unscrupulous contractor, the risk of non-fulfillment of the contract on time and assessing the likely timing and cost of implementing the contract should take into account the importance of signs when evaluating and making decisions. The importance of features indicates how strongly a certain feature or factor affects the risk or the assessed characteristic. Taking into account the importance of features allows the system to focus on the most informative factors and take into account their impact on risks and the characteristics being evaluated. This can help reduce noise and improve the quality of estimates and forecasts, which in turn will improve decision-making in the context of selecting contractors, executing contracts and estimating the timing and cost of their implementation.
4. After testing the results in assessing the risks of executing government contracts, an intelligent automated system for monitoring government contracts can be deployed on the servers of potential customers.
Let's consider each of the stages.
A data set for assessing the risks of executing government contracts
One of the first steps in the development of the methodology is the collection and structuring of data related to the execution of government contracts. The data will include information about contracts, contractors, work performed, deadlines and quality assessment. It is important that the data is structured and conveniently available for use.
As part of the study, a unique data set was collected, consisting of more than 83 thousand data on more than 190 features from systems: Register of Public Procurement of the Unified Information System (UIS) (https://zakupki.gov.ru /); The Register of Unfair Suppliers (RNP) EIS (https://zakupki.gov.ru/epz/dishonestsupplier/search/results.html ); SPARK Information System (https://spark-interfax.ru /).
The data set for the study is described in more detail in the article "Methodology for assessing the importance of features in the analysis of the implementation of government contracts", the journal "National Security" .
Module for assessing the risk of non-fulfillment of the contract on time
This module is a tool for solving the classification problem, which has the functionality of predicting not only discrete responses ("fulfilled / not fulfilled"), but also the probabilities of the occurrence of corresponding risks. This approach ensures the flexibility of the system. The module is developed in the Python programming language in the Jupyter Notebook environment. The scripts are saved in .ipynb format and can be integrated into Google Colab and Yandex Colab cloud services. The module supports data import, analysis, preprocessing and classification, training and testing of models, construction of probability distributions and visualization of results for interpretation of results and competitive advantage.
Verification of the adequacy and quality of the module was carried out by analyzing data collected in three information systems covering about 80 thousand government contracts on 192 grounds.
During the risk assessment process, the "Status" variable was identified as a target. It can take two values: "Contract executed" and "Contract terminated", which reduces the task to classification. The original dataset contains 83834 observations and 192 variables. Despite the high dimensionality and missing values in most observations, such a volume of data can contribute to a qualitative assessment of the probability of termination of the contract.
The distribution of contracts in the dataset can be considered balanced, which is confirmed by Figure 1, where "1" corresponds to terminated contracts, and "0" – executed.
Figure 1 – Distribution of the target class [Compiled by the authors]
For efficient data processing, the method of encoding labels for all non-integer features will be applied. In the context of a large amount of data, this approach is preferable to one-hot coding. Encoding will be performed for each corresponding column, and the missing values will be replaced with the code 100000, ensuring that a new code is obtained even under the worst coding conditions.
To predict the probabilities of the "Contract terminated" status, a logistic regression model will be trained. The training and test samples were divided in a ratio of 85% to 15%.
After encoding the data, the values in most columns are distributed in the range from 0 to N, where N is the number of categories, while the numeric columns remain unchanged. To optimize the effectiveness of the model, it is recommended to standardize the data, which will ensure a uniform distribution of features.
On the test data, the logistic regression model showed an Accuracy metric equal to 97.01%. This is a fairly good result, since only 2.99% get incorrect labels when predicting on new data.
Next, we investigate the probability distribution of contract violations in a test sample using a logistic regression estimate (see Figure 2).
Figure 2 – Probability distribution of contract risks by logregression [Compiled by the authors]
Based on the data from Figure 2, the model in most cases provides certain forecasts, with probabilities approaching 0 or 1. The percentage distribution in the range from 0.1 to 0.9 is insignificant. This indicates the limitations of the logistic regression model in assessing the probabilities of risks. A similar lack of flexibility of responses is observed in the "Decision Tree" model, which is confirmed by the accuracy of the model, which is 97.89%, and is illustrated in Figure 3.
Figure 3 – Probability distribution of contract risks on the decision tree [Compiled by the authors]
Finally, let's build a model based on the support vector machine. The accuracy metric on the test data for this model is 97.88%, which is slightly inferior to the indicator of the "Decision Tree" model. However, this method provides a more flexible risk assessment. The probability distribution of risks for this method is shown in Figure 4.
Figure 4 – Probability distribution of contract risks by the method of support vectors [Compiled by the authors]
The analysis of the histogram shows that the probabilities of risks under contracts do not have bimodal boundaries and are generally unevenly distributed. At the same time, most of the contracts are classified as "Contract terminated" with a probability close to 0.5. Table 1 shows the results of comparing different classification models.
Table 1 – Results of comparison of different classification models
If the goal is detailed contract mining, the most appropriate model is SVM, which provides forecasting of various probabilities corresponding to real conditions. An assessment of the importance of the features will be carried out for the SVM. The best and worst values determined using the feature imports method.
From the analysis, we can see that an important contribution to the forecast is made by such features as the company's tax authority, the EBIT interest coverage ratio, the description of the purchase object, the concentration coefficient of borrowed capital. The impact of the analyzed factors on the performance of the state contract is associated with various aspects: the financial stability of the company, determined by the tax status; the coefficient of interest coverage on EBIT; the quality of the description of the purchased object; the coefficient of concentration of borrowed capital. It is possible to assess the impact of these elements on the risk of non-fulfillment of the state contract through financial analysis, audit, review of previous contracts, risk assessment system and modeling.
Module for estimating the likely timing and cost of the contract
Forecasting the cost and deadlines of government contracts is a regression task, for which a Python module was created, stored in the IPython Notebook format (.ipynb) and compatible with Google and Yandex cloud platforms. This module uses Python libraries: Pandas, Matplotlib and Scikit Learn for processing, visualizing data and learning regression models. The module provides the ability to load and process data, the separation of the data set by cost and execution time, feature coding, standardization, testing regression models and visualization of key factors affecting the timing and cost of contracts.
The adequacy and accuracy of the developed module were evaluated based on data from three information systems, including about 80 thousand contracts (191 signs). The dependent variables were determined by the cost and term of the contract, calculated as the difference between the start and end dates. Models with the key metrics R-square, MSE and MAPE were developed. Data preprocessing for classification included typing, LabelEncoding of categorical data, and standardization to optimize algorithms. The original dataset (83834 x 191) has been reduced to a single variable representing the difference between two dates. The gaps in the data were filled with the average values for the column. All data was divided in a ratio of 85% to 15% for training and testing the model. However, a significant effect of retraining models was observed. The results of forecasting the terms of the contract are presented in Table 2.
Table 2 – Analysis of contract performance regression models
The analysis of the presented information indicates the superiority of the decision tree in achieving optimal results, subject to standardization, without which the quality of models deteriorates. For additional improvement, it is recommended to conduct a deeper analysis of the feature space and the use of neural networks. These conclusions are consistent with the results obtained for contract value forecasting models, which are presented in table 3.
Table 3 – Analysis of contract value regression models
The analysis of the results indicates the superiority of the decision tree in predicting the value of contracts, while the effectiveness of other models leaves much to be desired. The regression model allows for an initial assessment of the terms and cost of the contract, but requires additional adjustments. The retraining of models that show 100% results in the training sample is noted, which suggests the need to apply regularization methods in the future.
We also note that the data contains "outlier" contracts that last a very long time compared to others. This probably leads to high prediction errors.
Data on the terms of execution of contracts contain emissions and will be cleared according to the condition:
Q1 – 1,5R < Days < Q3 + 1,5R,
where Q1 and Q3 are the first and third quartiles of the Days series, R is the interquartile range.
Data on the cost of state contracts are distributed evenly (Fig. 5).
Figure 5 – Histogram of contract values [Compiled by the authors]
Based on the effectiveness of decision trees, ensemble models will be developed, including Extra Tree and Random Forest. After removing the outliers, the training and test data will be divided in the proportion of 90% to 10%. The results of forecasting the terms of the contract for various ensembles are presented in Table 4.
Table 4 – Analysis of contract performance regression models
Table 4 shows that ExtraTree and randomForest models provide the most accurate forecasts. At the same time, the average error when using the ExtraTree ensemble does not exceed one month, which is a satisfactory result.
Table 5 shows the results of forecasting the value of the contract.
Table 5 – Results of forecasting the value of the contract
The ExtraTree and Random Forest models demonstrate optimal results in cost forecasting, with an average error of 4500-4600 rubles, which is considered acceptable.
From the model, it can be revealed that the identification of the customer, his requirements and terms of the contract, the effectiveness of organizations in the procurement and control processes, the number of the auction notice, the description of the object of purchase and the identification code of the purchase affect the timing of the contract. These factors may directly or indirectly affect the forecast of the terms of the contract. For a more accurate forecast, it is necessary to take into account other factors.
We will conduct a detailed analysis of the factors that significantly affect the cost of the state contract:
1. The volume and cost of the supplied products or services, reflected in the number of completed records. A larger number of records usually indicates a larger volume of deliveries, which affects the total value of the contract. This factor can also affect the calculations related to delivery, storage and logistics.
2. The item code of the purchase object, which defines specific products or services for purchase. Different procurement objects may have different costs depending on the level of technology or specific skills required for their production or provision.
3. The total cost of the delivered products, which indicates the total cost of the supply of products or services and directly affects the total cost of the contract.
4. The amount in rubles of the object of purchase, reflecting the amount of costs associated with the object of purchase. A large amount indicates the significance and importance of the supplied product or service and can significantly affect the total cost of the contract.
All these factors directly or indirectly affect the cost parameters of the state contract. Taking them into account when predicting value contributes to the assessment and planning of costs by customers, as well as the formation of pricing by suppliers. For a more accurate forecast of the cost of the state contract, additional factors such as inflation, currency fluctuations, the cost of raw materials, etc. must be taken into account.
The program complex of risk assessment of the execution of public contracts
The software package for assessing the risks of executing government contracts is a web application that can be deployed both on local servers of potential customers and on the global network.
It consists of 3 main modules:
1. Evaluation of a potential customer. In this module, it is enough to enter the TIN of a potential partner and get the necessary information about him. The customer enters the INN of the potential buyer and there appears information about him, including all tax, financial and legal components, in any convenient form (tables, graphs, diagrams, etc.).
2. Assessment of the probability of contract execution.
By filling out the proposed forms, we can show the probability of contract execution.
Note that in this module it is possible to output the most important signs for the forecast, in any form of representation, which the user can independently analyze.
3. Estimation of the forecast of the cost and terms of the contract
By filling out the proposed forms, the user can get information about the cost of the contract and the timing of its execution. In addition, the module can provide graphs, charts or statistical indicators that will help to understand the accuracy and reliability of the forecast.
As a result of the study , the following results were obtained:
A unique data set has been collected, consisting of more than 83 thousand data on more than 190 features from the systems: Register of Public Procurement of the Unified Information System (UIS) (https://zakupki.gov.ru /); Register of Unscrupulous Suppliers (RNP) EIS (https://zakupki.gov.ru/epz/dishonestsupplier/search/results.html ); SPARK Information System (https://spark-interfax.ru /).
Automated data collection and updating systems have also been developed that can be deployed on the servers of potential customers.
Based on the analysis of interpreted machine learning methods for solving problems of forecasting the execution of government contracts, methods of analyzing large-volume unstructured information, determining criteria for building machine learning models, interpreted machine learning models for solving problems were built:
• assessment of the risk of choosing an unscrupulous contractor;
• assessment of the risk of non-fulfillment of the contract on time;
• assessment of the likely timing and cost of the contract implementation.
The models were tested and the learning metrics showed high scores.
Each of the algorithms has been developed taking into account the following criteria: reliability, comprehensibility and interpretability, universality. scalability and flexibility.
An intelligent system aimed at assessing the risk of choosing an unscrupulous contractor, the risk of non-fulfillment of the contract on time and assessing the likely timing and cost of implementing the contract takes into account the importance of signs in evaluating and making decisions. The importance of features indicates how strongly a certain feature or factor affects the risk or the assessed characteristic. Taking into account the importance of features allows the system to focus on the most informative factors and take into account their impact on risks and assessed characteristics. This can help reduce noise and improve the quality of estimates and forecasts, which in turn will improve decision-making in the context of selecting contractors, executing contracts and estimating the timing and cost of their implementation.
As a result of the research, a software package was developed for intelligent forecasting of the execution of government contracts. This complex provides an opportunity to conduct a more accurate risk analysis using unstructured information analysis methods, machine learning models and interpreted methods. This makes it possible to increase the effectiveness of monitoring the implementation of government contracts and reduce the likelihood of corruption and violations.
The study proves the importance and applicability of machine learning methods and models in the field of government contracts. The software package for intelligent forecasting of the execution of public contracts provides new opportunities to improve the processes of control and decision-making in the field of public procurement.
Peer reviewers' evaluations remain confidential and are not disclosed to the public. Only external reviews, authorized for publication by the article's author(s), are made public. Typically, these final reviews are conducted after the manuscript's revision. Adhering to our double-blind review policy, the reviewer's identity is kept confidential.