On the road to a blog in machine learning and data analysis for subsurface data
Machine learning algorithms are increasingly being used in understanding subsurface data. CSEG ‘s digital committee has approached me to start a blog initiative to showcase articles in data analysis and machine learning. Articles from groups in the subsurface industry and academia will be solicited and posted on the blog. The aim of the blog will be to give a flavor of the types of models or data flows that have been explored. A secondary aim is to provide a tutorial on various data analysis and learning methods. In this article, I will provide a broad introduction to data analysis and machine learning and discuss their potential uses in learning subsurface data.
Over the last decade, the usage of data analysis and machine learning have seen an exponential growth. The growth is largely driven by the need to gain deeper insights into the ever-increasing consumer data on hand. Significant strides in computing optimization, scale and hardware improvements in speed and memory have worked in tandem with the increasing data-size. In addition, open-source programming languages have had an enormous impact on this growth. Large scale computing enabled by clusters, grids and the cloud can glean into datasets in a time-span that was unthinkable a decade ago. Today, terabytes of data, and computers with 10s of gigaflops of compute power using CPUs (central processing unit) and 10s of gigabytes for RAM are a common sight. The compute power of GPUs (graphics processing unit) is on average about 10 times more than that of the CPUs. Innovation in optimization algorithms to learn from the data have in my opinion, out-paced the developments in the hardware (namely flops and memory). Languages such as Python, R and others have led to a democratization of code that has resulted in better software. The open-source software movement today knows no borders and continues to grow. Major technology companies like Google, Microsoft, Amazon and others have aptly recognized the advantages of open-source and are participating in this space. For example, TensorFlow which was originally started by Google is available as an open-source API (application programming interface) for Python and contains modern machine learning methods. The data can be in a tabular format or can be time or depth series.
It is important to distinguish data analysis and machine learning, as the two are sometimes used interchangeably. For the purposes of this blog, I will be defining it as such. Data analysis refers to using the information in the existing data and infer from it. It entails data mining, summarizing and trend extraction based on existing data. For example, using current and previous production data, data analysis can show if there was a net increase in oil production over time.
Machine learning is a more robust way to not only summarize data or extract trends, but to predict or classify new data. It requires understanding the data using a systematic data analysis framework. The term exploratory data analysis (EDA) is an important step in machine learning and it is the process or processes by which data is analyzed to understand and summarize its characteristics. This summary can help to identify important variables and to decide on the type of model to use. You can see that the methods used in EDA sound like those done during data analysis. Next, the machine learning model is trained on a subset of the dataset called the training data. The parameters of the model are then tuned using a second partition of the data called the validation data. Tuning or parameter search is necessary to avoid overfitting the input to the output. Finally, the model is evaluated on a third partition called the test data. The accuracy of the model is measured during evaluation to determine how generalizable the model is. The use of training, validation and test datasets using sound statistical sampling make machine learning robust in the sense it can be generalized to predict new data. The two main types of machine learning are supervised and unsupervised. In supervised learning, both the input data (e.g. images of objects) and their respective output’s identification or labels (i.e., a cat, apple etc.) are known and fed to the model to learn the parameters that describe the relation between the input and the output. Unsupervised learning does not contain labelled data and relationships between the observed data are typically learned from the data itself using some criteria. For example, when distance dependent separation exists between the different groups, a distance measure can be used to cluster the data into separate groups. It is important to emphasize that the ideas used in machine learning namely, data partitioning, the types of models, parameter selection criteria, inferences and prediction has been known for a while. Affordable data storage, large volumes of data and faster computation have recently made machine learning a practical tool. Next, I will briefly delve into a few popular machine learning models.
Neural networks, convolutional neural networks (CNN), generative adversarial networks, collaborative filtering, gradient boosting, reinforcement learning, regression [1,2,3,4,5,6,7] and a myriad of other techniques are examples of supervised learning. Unsupervised learning techniques such as K-Means clustering [8] play an important role when labeled data is not available. When many models are available, the question of choice of the model and its complexity arises. The choice of the model is dependent on the statistical characteristics of the data and the problem’s objective (classification or other). In general, the complexity of the model depends on the complexity of the data and in turn, complex models usually require larger datasets. In this article, I have been referring to the algorithm used, the type of architecture (for example, CNN etc.) and loss function collectively as a model.
Having provided a context for machine learning and data analysis, this blog hopes to inform the subsurface community and to motivate more readers to contribute. May the contributions lead to better data integration and provide reliable predictive methods for subsurface data. Also, may the contributions help answer the question of where machine learning is useful over traditional approaches. Can and which machine learning models improve noise attenuation, facies classification, salt identification, oil and gas sweet spot identification, production rate prediction and other ongoing research questions?
I will end this article by noting the caveat that not all approaches of machine learning will work in all situations. The usefulness of machine learning for a given problem will depend on the type of model that the practitioner chooses. For example, to classify objects, one cannot fit a linear regression curve, but must consider some sort of a network or other classification method instead. Similarly, statistical properties of the data and assumptions that each model makes will govern the performance of the model. I believe that the human expert will play a key role in deciding how to pre-condition the data, which machine learning architecture to use and to show key insights from the data and its results in a way that is explainable to their peers. Particularly, I believe that the question of why a given decision was chosen by the given learning method will be an area in which human experts will play a key role. Irrespective of whether the algorithm worked or not will pave the way for better understanding of the models.
In this article, I have not been able to cite all the machine learning models that are used in practice, while keeping this post succinct. Please refer to the DoodleTrain courses provided by Russell [9] and Chopra and Marfurt [10] for references on machine learning approaches for subsurface data. I deem it necessary to thank all the contributors in all industries and especially the subsurface community that have contributed to data analysis and machine learning.
The first article that we will be publishing is from authors Gian Matharu, Wenlei Gao, Rongzhi Lin, Yi Guo, Minjun Park and Mauricio D. Sacchi who are with the University of Alberta. Their article is titled “Simultaneous source deblending using a deep residual network”. Deblending is the process of removing the contribution of the interfering source from the primary source. To reduce acquisition time and/or to increase source density, the interfering source is shot before the end-time of the primary source’s record. The authors use deep residual networks to deblend the interference. Residual networks are a variant of CNNs in which subsequent layers use the input and the output from previous layer/s to obtain faster convergence and better stability. I hope you will enjoy reading their and other works that will be posted on the blog, provide your constructive feedback and pose your questions.
References
- McCulloch, Warren; Walter Pitts (1943). "A Logical Calculus of Ideas Immanent in Nervous Activity". Bulletin of Mathematical Biophysics. 5 (4): 115–133.
- Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard and L. D. Jackel: Backpropagation Applied to Handwritten Zip Code Recognition, Neural Computation, 1(4):541-551, Winter 1989.
- I.J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative Adversarial Networks. NIPS 2014.
- Francesco Ricci and Lior Rokach and Bracha Shapira, Introduction to Recommender Systems Handbook, Recommender Systems Handbook, Springer, 2011, pp. 1-35
- Freund, Y. and Schapire, R. [1995] A decision-theoretic generalization of on-line learning and an application to boosting., Journal of Computer and System Sciences.
- Kaelbling, Leslie P.; Littman, Michael L.; Moore, Andrew W. (1996). "Reinforcement Learning: A Survey". Journal of Artificial Intelligence Research. 4: 237–285.
- Yule, G. Udny (1897). "On the Theory of Correlation". Journal of the Royal Statistical Society. 60 (4): 812–54.
- MacQueen, J. B. (1967). Some Methods for classification and Analysis of Multivariate Observations. Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability. 1. University of California Press. pp. 281–297.
- Russell, Brian. (2019). Neural networks and machine learning applications in petroleum exploration., The CSEG DoodleTrain course.
- Chopra, Satinder. and Marfurt Kurt. (2019). Seismic facies classification using machine learning tools., The CSEG DoodleTrain course.
On behalf of the CSEG Digital Media committee, I invite tutorial articles or case-studies in data analysis and machine learning. Obtain the required approval from your organization to publish and come forward to share your articles by sending it to the email given below. For further questions on how you can contribute to the blog, please reach out at mlcsegblog@gmail.com.
CSEG Digital Media Committee members: Jocelyn Frankow and Jason Schweigert.