About the Client
The client is a privately held exploration and production oil company that focuses on acquisitions, exploration, and development in the USA and Holland.
Geologists and geophysicists use different methods to search for geological structures that may form oil reservoirs.
Currently, it costs around $85,000 per square mile for an oil and gas company to check a field for oil.
All around, they spend at least $1M and possibly over $40M before they see any results.
Our client had DNA samples of solids from 13 different areas that contain more than 3,000 hundred fields with microelements and characteristics for each area.
The client wanted to make predicting oil fields easier thanks to solids DNA.
Quantum found a way to predict oil fields by using the DNAs of solids, saving time and money for the client.
Our machine learning model predicts the location of an oil field with a 70% accuracy.
And as soon as the client collects more data about different areas, our R&D team will improve the result.
Data understanding and preparation
Our research began with an attempt to highlight the main features of solids DNA with an analytical approach. We had to determine the importance of fields. For feature selection, we applied regression, statistical and other methods to all of our mixed data to get relevant results. And as we mixed data in different ways, we created new datasets.
Our team used mixed datasets for ML model training. As usual, we started with different algorithms to find the best one, but on the first iteration, we couldn’t tell the difference between 13 groups of data and got only 50% of accuracy.
After we discussed the problem with the client, we made a reasonable conclusion: the fact that all data was taken from different regions and in different seasons was to blame. We split the datasets according to this principle, thus improving the result by 20%.
To show more than just a thousand of lines with thousands of fields, we decided to show each field as a group of pixels that could change their saturation depending on the importance of certain features. This method showed us a full image we could analyze.
Let's discuss your idea!
We used the FloydHub platform for data processing and model training. Scikit-learn, SciPy, Matplotlib, and Seaborn were used for EDA and visualization.
We also developed an automated pipeline to find the algorithm for selecting the most valuable feature. As we were looking through algorithms, we used a bunch of approaches from correlation and stacks of L1 regularized regressors to unsupervised approaches and dimensionality reduction methods. All results were saved as separate datasets.
Further model selection was also done automatically by specifying the necessary models and evaluating them against each dataset. The most “valuable” datasets were then selected to build more accurate models by fine parameter tuning and building sophisticated models like DNN and gradient boosting.
The libraries we used:
- pandas, NumPy for data manipulation and visualization
- Scikit-learn for data analysis and processing, feature selection and clusterization
- Scikit-feature for unsupervised feature selection
- SciPy for data analysis
- Matplotlib, Seaborn for visualization
- XGBoost for gradient boosting
- Keras for DNN
- hyperopt for parameter tuning
- imblearn for dealing with imbalanced data