PAF格式和SAM格式详解

1、PAF格式

paf格式为minimap2默认输出格式,结果至少包括12列:

ColTypeDescription
1stringQuery序列ID
2intQuery序列长度
3intQuery比对开始位置(*based)
4intQuery比对结束位置(*based)
5char如果query/target是正链关系,’+’表示;负链关系,’-‘表示
6stringTarget序列ID
7intTarget序列长度
8intTarget比对开始位置
9intTarget结束开始位置
10int比对上的碱基数 (matching bases)
11int对齐区域长度(包括gaps)
12int比对质量(0-255)

标签说明:

TagTypeDescription
tpAaln类型:P/primary, S/secondary and I,i/inversion
cmiNumber of minimizers on the chain
s1iChaining score
s2iChaining score of the best secondary chain
NMiTotal number of mismatches and gaps in the alignment
MDZTo generate the ref sequence in the alignment
ASiDP alignment score
SAZList of other supplementary alignments
msiDP score of the max scoring segment in the alignment
nniNumber of ambiguous bases in the alignment
tsATranscript strand (splice mode only)
cgZCIGAR string (only in PAF)
csZDifference string
dvfApproximate per-base sequence divergence
defGap-compressed per-base sequence divergence
rliLength of query regions harboring repetitive seeds

2、SAM格式

SAM文件由两部分组成,头部区和主体区,都以tab分列。

image.png

1. 头部区:以’@’开始,体现了比对的一些总体信息。比如比对的SAM格式版本,比对的参考序列,比对使用的软件等。

2. 主体区:比对结果,每一个比对结果是一行,有11个主列和一个可选列。

主体区部分:

关键字描述
1QNAMEQuery序列ID
2FLAGBwise FLAG(表示比对类型:paring,strand,mate strand等),如:0,99,256,2048等
3RENAMETarget序列ID
4POS比对到参考序列上的位置,从1开始计数;未比对上为0
5MAPQ比对的质量分数(越高说明比对到参考序列的上的位置越准确)。如果是255,说明该比对值无效。
6CIGAR简要比对信息表达式
7MRNM下一片段比对上的参考序列编号
8MPOS下一片段比对上的位置,如果不可用,此处为0
9ISIZE插入片段长度
10SEQ和参考序列在同一个链上比对的序列(若比对结果在负义链上,则序列是其反向重复序列,反向互补序列)
11QUAL比对序列的质量(ASCII-33=Phred base quality)reads碱基质量值
12Optional Fields可选的列以TAG:TYPE:VALUE的形式提供额外的信息

1. FLAG释义表如下

image.png

2.   CIGAR string,简要比对信息表达式,示例如下图

image.png

部分内容转自:https://blog.sciencenet.cn/blog-994715-1341509.html

Augustus安装

Augustus作为基因预测必备软件,其源代码安装过程是迄今为止碰到的最难安装的软件,没有之一,现在终于勉强安好了(花了三天)。

  1. 下载最新的augustus版本并解压
wget -c https://github.com/Gaius-Augustus/Augustus/releases/download/v3.4.0/augustus-3.4.0.tar.gz
tar xzvf augustus-3.4.0.tar.gz
cd augustus-3.4.0/

2. 按照README.md的说明安装依赖包,如果是centos系统,安装对应的软件

## Install dependencies

The following dependencies are required for AUGUSTUS:
- For gzip compressed input:
 (set ZIPINPUT = false in [common.mk](common.mk) if this feature is not required or the required libraries are not available)
  - libboost-iostreams-dev
  - zlib1g-dev
- For [comparative AUGUSTUS](docs/README-cgp.md) (multi-species, CGP):
  (set COMPGENEPRED = false in [common.mk](common.mk) if the libraries required by the CGP version are not available. Augustus can then only be run in single-genome mode, which is what most users need.)
  - libgsl-dev
  - libboost-all-dev
  - libsuitesparse-dev
  - liblpsolve55-dev
  - libsqlite3-dev (add SQLITE = false to [common.mk](common.mk) if this feature is not required or the required library is not available)
  - libmysql++-dev (add MYSQL = false to [common.mk](common.mk) if this feature is not required or the required library is not available)
- For compiling bam2hints and filterBam:
  - libbamtools-dev
- For compiling utrrnaseq:
  - libboost-all-dev (version must be >Boost_1_49_0)
- For compiling bam2wig:
  - Follow [these instructions](./auxprogs/bam2wig/README.md). Note that it shouldn't be a problem to compile AUGUSTUS without bam2wig. In practice, you can simply use `bamToWig.py` to accomplish the same task.
- For compiling homgenemapping
  (set BOOST = FALSE in [./auxprogs/homgenemapping/src/Makefile](./auxprogs/homgenemapping/src/Makefile) if the option --printHomologs is not required or the required libraries are not available)
  - libboost-all-dev

3. 由于后面make时一直显示错误“ /usr/bin/ld: cannot find -lmysqlclient ”,尽管mysql和mysql++已经安装好,只能选择“add MYSQL = false to common.mk”了,然后在common.mk文件后面增加suitesparse和htslib库文件的地址,如果不知道库文件的地址在哪里,可以用find或者locate 命令查找suitesparse和htslib。

MYSQL = false
INCLUDE_PATH_SUITESPARSE := -I/usr/include/suitesparse
LIBRARY_PATH_SUITESPARSE := -L/usr/lib64 -Wl,-rpath,/usr/lib64

INCLUDE_PATH_HTSLIB   := -I/usr/local/include/htslib
LIBRARY_PATH_HTSLIB   := -L/usr/local/lib -Wl,-rpath,/usr/local/lib

4. 如果是Centos系统bamtools需要手动安装,其中还需要先安装jsoncpp,bamtools可以安装在当前用户目录,如我的是安装在“~/local/bamtools”这个目录,记住这个路径,后面还要用到。

wget https://github.com/pezmaster31/bamtools/archive/refs/tags/v2.5.2.tar.gz
tar xzvf bamtools-2.5.2.tar.gz
cd bamtools-2.5.2/
yum install jsoncpp-devel.x86_64   #########BamTools also makes use of JsonCpp for certain serialization tasks.
mkdir build
cd build/
mkdir ~/local/bamtools
cmake -DCMAKE_INSTALL_PREFIX=~/local/bamtools ..
make && make install

修改augustus-3.4.0/auxprogs/bam2hints和
augustus-3.4.0/ auxprogs/filterBam/src下的Makefile文件,指明bamtools库文件所在位置。

Now bamtools should have been correctly installed. Next, we need to modify the Makefiles of bam2hints and filterBam to adapt them with our manually installed bamtools.

First, go to the “augustus-3.4.0/auxprogs/bam2hints” directory and make the following changes for the Makefile:

Add:
   BAMTOOLS = ~/local/bamtools

Replace:
   INCLUDES = /usr/include/bamtools
By:
   INCLUDES = $(BAMTOOLS)/include/bamtools

Replace:
   LIBS = -lbamtools -lz
By:
   LIBS = $(BAMTOOLS)/lib64/libbamtools.a -lz

Then, go to the “augustus-3.4.0/auxprogs/filterBam/src” directory and make the following changes for the Makefile:

Replace:
BAMTOOLS = /usr/include/bamtools
By:
BAMTOOLS = ~/local/bamtools

Replace:
INCLUDES = -I$(BAMTOOLS) -Iheaders -I./bamtools
By:
INCLUDES = -I$(BAMTOOLS)/include/bamtools -Iheaders -I./bamtools

Replace:
LIBS = -lbamtools -lz
By:
LIBS = $(BAMTOOLS)/lib64/libbamtools.a -lz

Now, we are finally ready to compile Augustus. Get back to the “augustus-3.4.0” directory and type “make BAMTOOLS=~/local/bamtools”, viola!
make BAMTOOLS=~/local/bamtools

如果出现“bam2wig.c:12:10: fatal error: bgzf.h: No such file or directory” 和 “bam2wig.c:18:17: fatal error: sam.h: No such file or directory”, 使用find或者locate命令查找“bgzf.h” 和 “sam.h” 的位置,然后在 “auxprogs/bam2wig/bam2wig.c”文件中修改这两个文件的位置, 例如:

#include "/usr/local/include/htslib/bgzf.h"
#include "/usr/local/include/htslib/sam.h"

重新运行:

make BAMTOOLS=~/local/bamtools

如果没有再产生错误,恭喜,已经成功安装好了,如果方便后面的使用,可以将需要用到的代码加入环境变量。

export PATH=$PATH:~/soft/augustus-3.4.0/bin:~/soft/augustus-3.4.0/scripts

试着运行一下,查找有没有合适的参考物种:

augustus --species=help

如果有其它错误,可以参考这篇博文