ImpReflC

Genotype Imputation Reference Panel Intelligent Customization

ImpRefIC is an intelligent software tool designed for genotype imputation in genomics research, utilizing a customized reference panel to achieve higher imputation accuracy. It skillfully processes Variant Call Format (VCF) files, converting genotypes into a numerical format conducive to advanced machine learning algorithms. Harnessing the power of Logistic Regression, ImpRefIC adeptly manages imbalanced data through a RandomOverSampler upsampling technique. The tool saves results, including the predicted optimal reference population and their respective probabilities, in a user-specified directory. It transparently reports its operational duration to keep users informed. However, note that limited consistent SNPs or chromosomal diversity may potentially influence prediction precision.

In its current version, ImpRefIC is primarily dedicated to porcine research. Nevertheless, we're committed to embracing the diversity of genomics research and are diligently working towards extending ImpRefIC's functionality to accommodate other species. Stay tuned for these exciting updates to broaden your research horizons!

1.Installation

git lfs clone --recursive https://github.com/klzhang2022/ImpRefIC.git


Requirements

•python 3 (https://www.python.org)

•python modules and packages

import re
import sys
import numpy as np
import pandas as pd
import bz2
import gzip
import random
import time
from time import strftime, gmtime
from collections import defaultdict
import joblib
import metrics
from sklearn.metrics import *
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
import warnings

•git

# Download git
cd /usr/local/src/
wget https://www.kernel.org/pub/software/scm/git/git-2.38.0.tar.gz --no-check-certificate
tar -zxvf git-2.38.0.tar.gz
cd git-2.38.0
./configure
make
make install
echo "export PATH=$PATH:/usr/local/git/bin" >> ~/.bashrc
source ~/.bashrc
git --version
# Configure the upload file limit of git to 50000MB
git config --global http.postBuffer 52428800000

•git-lfs

# Download git-lfs
cd /usr/local/src/
wget https://github.com/git-lfs/git-lfs/releases/download/v3.2.0/git-lfs-linux-amd64-v3.2.0.tar.gz
tar -zxvf git-lfs-linux-amd64-v3.2.0.tar.gz
cd git-lfs-3.2.0
./install.sh

2.Running

cd ImpRefIC_path

# ImpRefIC installation directory
ImpRefIC_path

python3 ImpRefIC.py ./example/test.vcf.gz ./ ./example

# Compressed VCF file of the target sample
./example/test.vcf.gz

# ImpRefIC installation directory
./

# Output path
./example

3.Input file

Compressed VCF file

https://samtools.github.io/hts-specs/VCFv4.2.pdf

4.Output file

4.1 ImpRefIC.out.population.proba

# probability matrix of target samples and 64 breeds/lines

American_Yorkshire Canadian_Yorkshire Danish_Yorkshire Dutch_Yorkshire French_Yorkshire Unknown_Yorkshire_lines Landrace Duroc Berkshire Goettingen_Minipig Hampshire Iberian Mangalica Pietrain Angler_Sattleschwein British_Saddleback Bunte_Bentheimer Calabrese Casertana Chato_Murciano Cinta_Senese Gloucester_Old_Spot Large_Black Leicoma Linderodsvin Middle_White Nero_Siciliano Tamworth European_Wild_boar Yucatan_minipig Creole American_Wild_boar Bamei Baoshan Enshi_black Erhualian Hetao Jinhua Korean_black_pig Laiwu Meishan Min Neijiang Rongchang Tibetan Tongcheng Hubei_White Daweizi Jiangquhai Leping_Spotted Penzhou songliao_black_pig Taihu Wannan_Spotted Wujin Ya_nan Diannanxiaoer Luchuan Wuzhishan Bamaxiang MiniLEWE Xiang Asia_Wild_boar Hybrid
0.0000 0.0000 0.9999 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0000 0.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0000 0.0000 0.9999 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0000 0.0000 0.9999 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0000 0.0000 0.9999 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.9988 0.0002 0.0002 0.0001 0.0001 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0002
0.9962 0.0003 0.0002 0.0001 0.0002 0.0007 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0018
0.0253 0.0036 0.0564 0.0032 0.0043 0.0644 0.1547 0.0002 0.0017 0.0010 0.0018 0.0010 0.0027 0.0022 0.0051 0.0016 0.0026 0.0012 0.0015 0.0026 0.0009 0.0009 0.0010 0.0018 0.0029 0.0027 0.0077 0.0009 0.0036 0.0010 0.0020 0.0010 0.0006 0.0003 0.0013 0.0000 0.0012 0.0001 0.0013 0.0007 0.0002 0.0037 0.0001 0.0000 0.0009 0.0002 0.0259 0.0002 0.0003 0.0003 0.0022 0.0015 0.0013 0.0003 0.0005 0.0001 0.0002 0.0001 0.0003 0.0001 0.0006 0.0002 0.0006 0.5914
0.0134 0.0171 0.1671 0.0033 0.0057 0.0639 0.2500 0.0002 0.0019 0.0017 0.0012 0.0006 0.0016 0.0029 0.0054 0.0015 0.0012 0.0012 0.0016 0.0022 0.0013 0.0009 0.0013 0.0027 0.0023 0.0035 0.0030 0.0014 0.0031 0.0019 0.0022 0.0013 0.0012 0.0005 0.0042 0.0001 0.0018 0.0001 0.0011 0.0008 0.0005 0.0026 0.0001 0.0001 0.0013 0.0005 0.0641 0.0004 0.0004 0.0004 0.0038 0.0017 0.0022 0.0006 0.0005 0.0002 0.0004 0.0001 0.0009 0.0005 0.0012 0.0003 0.0003 0.3389
0.0209 0.0093 0.0347 0.0013 0.0028 0.0273 0.0732 0.0002 0.0021 0.0013 0.0012 0.0005 0.0017 0.0015 0.0026 0.0011 0.0023 0.0008 0.0011 0.0018 0.0010 0.0009 0.0006 0.0014 0.0019 0.0020 0.0031 0.0007 0.0014 0.0009 0.0016 0.0006 0.0003 0.0002 0.0017 0.0000 0.0011 0.0001 0.0007 0.0007 0.0002 0.0035 0.0001 0.0001 0.0007 0.0002 0.0247 0.0003 0.0003 0.0003 0.0021 0.0009 0.0018 0.0002 0.0004 0.0001 0.0002 0.0000 0.0002 0.0003 0.0006 0.0002 0.0004 0.7536
4.2 ImpRefIC.out.population

# target sample prediction results: sample ID; breed/line most similar to each target sample sequence

Sample ID Breed/Line
sample_1 Danish_Yorkshire
sample_2 Danish_Yorkshire
sample_3 Danish_Yorkshire
sample_4 Danish_Yorkshire
sample_5 Danish_Yorkshire
sample_6 American_Yorkshire
sample_7 American_Yorkshire
sample_8 Hybrid
sample_9 Hybrid
sample_10 Hybrid
4.3 ImpRefIC.out.ref.population

# customized reference panel for target samples

Customized reference panel
Danish_Yorkshire
American_Yorkshire
Hybrid
4.4 LogisticRegression.pkl

# the trained classification model

4.5 Output log
[INFO] Study samples: 10
[INFO] The chromosomes contained in the target VCF file: 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
[INFO] Study snps: 44244
[INFO] Consistent snps: 10741
[INFO] SNPs for classification and prediction: 10741

[INFO] Successfully initialize a new model !
[INFO] Training the model ……
[INFO] Model training completed !

===================Confusion Matrix===================
Predicted 0 1 2 3 4 5 6 7 ... 56 57 58 59 60 61 62 63
Actual ...
0 34 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0
1 0 34 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0
2 0 0 33 0 0 0 0 0 ... 0 0 0 0 0 0 0 0
3 0 0 0 27 0 0 0 0 ... 0 0 0 0 0 0 0 0
4 0 0 0 0 30 0 0 0 ... 0 0 0 0 0 0 0 0
... .. .. .. .. .. .. .. .. ... .. .. .. .. .. .. .. ..
59 0 0 0 0 0 0 0 0 ... 0 0 0 33 0 0 0 0
60 0 0 0 0 0 0 0 0 ... 0 0 0 0 31 0 0 0
61 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 29 0 0
62 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 35 0
63 0 0 0 0 0 0 1 0 ... 0 0 0 0 0 0 0 34

[64 rows x 64 columns]
Accuracy = 0.9995
Precision = 0.9995
Recall = 0.9995
F1 = 0.9995
[INFO] Model has been saved to /test/LogisticRegression.pkl
[INFO] The model starts predicting the target file ……
[INFO] Prediction complete ! The predicted frequencies and predicted optimal reference population have been saved to /test
[INFO] Total time consumption is 00:36:27

❋Note: few consistent SNPs or insufficient chromosomal diversity will result in inaccurate predictions❋