2017-09-03

メタゲノムデータを種レベルで検出し割合を計算するmOTUとfetch-MG

追記9/5；ソフト名や使い方を勘違いしておりましたので修正します。

　環境サンプル中の種の多様性を評価する手法として16S rRNA遺伝子を特異的に増幅する手法がよく知られているが、種によっては配列の異なるrRNA遺伝子を複数持つことがある。ここにPCR増幅のbiasもかかってくるため、16S rRNAだけでメタゲノムデータを評価すると、特に近縁種間の誤差が大きくなる可能性がある。

最近ではNGSを使った環境サンプルの全ゲノムシーケンスが選択肢として選べるようになった。しかしながら、メタゲノムシーケンスから菌叢の定量を行うと、遺伝子のコピー数の違いからbiasが非常に大きくなる。そのため、高精度な定量を行うには生物全てが保有しているようなユニバーサルでシングルコピーな（つまり滅多に水平伝達で複製しない）遺伝子セットを使ったメタゲノムデータの定量が必要になってくる。

これまでの研究で、シングルコピーでユニバーサルな遺伝子セットが見出されてきた（Ciccarelli et al., Science, 2006; Sorek et al., Science, 2007; von Mering et al., Science, 2007）。fetch-MGは先行研究で見出されたシングルコピーでユニバーサルな40遺伝子を使い、メタゲノムデータから菌の種類と割合を計算・出力する。

公式サイト

http://www.bork.embl.de/software/mOTU/index.html

チュートリアル

http://www.bork.embl.de/software/mOTU/tutorial.motu.standalone.html

インストール

mOTUとfetch-MGのダウンロードリンクもチュートリアルに存在している。

fetch-MG自体はmacで動作するが、bin/の実行ファイルがmacで動作しなかったのでcent OSにインストールした。

追記

condaを使う

#bioconda (link)
conda install -c bioconda -y motus

実行方法

fetch-MGを使い40のMGs（protein-coding phylogenetic marker genes）を抽出する。

./fetchMG.pl -m extraction -x bin example_dataset/example_data.faa

-t　Number of processors/threads to be used
-o　Output directory; default = 'output'
-h　Path to directory that contains hmm models; default = './lib'
-p　Set if nucleotide sequences for filename.faa is not available
-d　Fasta file with DNA sequences of the same genes; not neccesary if protein file and dna file have the same with .faa and .fna suffixes

入力はfaaファイル（ただし40のMGsが抽出できる前提で動作しているので、ゲノムの全タンパク質配列が揃っていないと正しく動作しない）。またdna配列（同じ名前で与える必要がある。e.g., example.fna、example.faa）があれば、該当するタンパク質の塩基配列も出力してくれる。詳細はexample_datasets/を参照。

作成中。

ここから下は下書きです。

ペアードエンドfastqを指定してラン（非圧縮かgz圧縮のfastqに対応）。

./motus.pl sample.1.fq sample.2.fq

--verbose Be more verbose while running the analysis
--processors=N This should be an integer and defines the number of processors that the script will use.
--length-cutoff=L The minimum size per read (after quality-based trimming), default: 45.
--identity-cutoff=I Minimum percentage identity in alignment (default: 97) --quality-cutoff=Q Basepair quality cutoff (default: 20)
--fastq-format The format of the input files. Must be one of 'auto' (the default), 'sanger', or 'illumina'. Note that new Illumina machines actually use the 'sanger' format. The auto-detection should generally work well.
--output-directory Where to place the final results file (by default it uses a directory named ``RESULTS``).

RESULULTSに解析結果が出力される。検出された菌の種名と割合が出力されている。

f:id:kazumaxneo:20170903161302j:plain

複数fastqを解析する場合、fastqをそれぞれのディレクトリに入れて、其のパスを記載したファイル（１行ずつ記載）を指定してランする。

./mOTUs.pl --sample-file input.txt

引用

Metagenomic species profiling using universal phylogenetic marker genes.

Sunagawa S1, Mende DR, Zeller G, Izquierdo-Carrasco F, Berger SA, Kultima JR, Coelho LP, Arumugam M, Tap J, Nielsen HB, Rasmussen S, Brunak S, Pedersen O, Guarner F, de Vos WM, Wang J, Li J, Doré J, Ehrlich SD, Stamatakis A, Bork P.

Nat Methods. 2013 Dec;10(12):1196-9. doi: 10.1038/nmeth.2693. Epub 2013 Oct 20.

ユニバーサルな遺伝子に関する先行研究

メタゲノムデータをbinningして出力可能なGUIアプリ VizBin

metagenome 結果の視覚化 (visualization) GUIツール 2015 Microbiome (Journal) Binning (metagenomics)

2019 7/5文章修正

VizBinはメタゲノムデータをレファレンスに依存せずにbinnigする手法。テトラヌクレオタイド頻度情報を使いアセンブルデータを分類する。最終的に２次元のPCAプロットとしてビジュアル化してくれる。どこからどこまでを１つの生物として抽出するかは、人間が目で見て決定する。論文にはThe human eye-brain system for cluster identificationと書かれている。複雑なメタゲノムでは使いにくいが、比較的な単純な（low complexitiesな）メタゲノムデータであれば目で見て分類は可能であり、面白いツールである。

アプリはJAVAで構築されている。

公式ページ

Download appから本体をダウンロード可能。

Tutorial

Tutorial · claczny/VizBin Wiki · GitHub

テストデータ

example dataset AMFJ01

実行方法

ビルド済みのjava実行ファイルが配布されている。これを叩いて起動する。

> java -jar VizBin-dist.jar

File to visualiseにcontigファイルを入力する。あらかじめメタゲノム用のアセンブラでcontigは作っておく必要がある。

f:id:kazumaxneo:20170902204406j:plain

計算用のthreadsを増やして計算を高速化することもできる。

　↓

startをクリック。抽出されたkmerの数が表示される。

f:id:kazumaxneo:20170902204541j:plain

　↓

しばらく待つ。

　↓

結果が表示された。

f:id:kazumaxneo:20170902205020j:plain

　↓

trackpad（２本指で上下にドラッグ）やマウスのホイールで拡大縮小が可能。

　↓

クリックして領域を囲むと、囲った領域のプロットを全て含むfastaが出力可能。

f:id:kazumaxneo:20170902205302j:plain

　↓

exportを選択。

f:id:kazumaxneo:20170903122240j:plain

　↓

fataが出力される。選択を間違えた時は右クリック->Selection->clear selectionでやり直しできる。

contigのサイズ、GCなども読み込める。exampleは以下のようになっている。

f:id:kazumaxneo:20170903123909j:plain

最初にヘッダー行が必要である。coverage、length、label、isMarkerなどの名前が認識に必要。contigの順番はfastaの順番と合致している必要がある。

入力する時はadvanxed modeに切り替える。show advanxed optionsをクリック。

f:id:kazumaxneo:20170903130235j:plain

　↓

aanotton fileの項目をクリック。

f:id:kazumaxneo:20170903130408j:plain

他にもいくつかのパラメータを変更可能である。詳細は公式マニュアルを参照（一番下に説明があります）。

引用

VizBin - an application for reference-independent visualization and human-augmented binning of metagenomic data

Cedric C Laczny, Tomasz Sternal, Valentin Plugaru, Piotr Gawron, Arash Atashpendar, Houry Hera Margossian, Sergio Coronado, Laurens van der Maaten, Nikos Vlassis, Paul Wilmes

Microbiome 2015 3:1

2017-09-01

複数のcontigをマージしアセンブリの連続性を改善する Mix

assembly contigのscaffolding 2013 BMC Bioinformatics genome assembly reconciliation

2019 6/11 追記

ゲノムアセンブリ構築の利点を得ることを妨げる課題の中には、未完成のアセンブリおよびその後の実験的な費用の両方がある。第一に、ゲノムデノボアセンブリのための多数のソフトウェアソリューションが利用可能であり、それぞれがその長所と短所を有し、それらの中からどのように選択するかに関する明確な指針がない。次に、これらのソリューションではドラフトアセンブリが作成され、多くの場合リソース集約型の仕上げフェーズが必要になる。

　本論文では、リファレンスゲノムに頼らずにコンティグ断片化を減らし、ゲノムの仕上げをスピードアップするという目標を持たずに、2つ以上のドラフトアセンブリをミックスするツールであるMixを開発することによってこれら2つの側面に対処する。提案したアルゴリズムは、頂点が連続体の末端を表し、辺がこれらの末端間の既存のアライメントを表す拡張グラフを構築する。これらのアライメントエッジはコンティグエクステンションに使用される。結果の出力アセンブリは、累積コンティグ長を最大化する拡張グラフ内の一連のパスに対応する。

　GAGE-B研究からの細菌NGSデータに対するMixの性能を評価し、そしてそれを新しくシーケンシングされたマイコプラズマゲノムに適用する。結果として得られる最終アセンブリは、全体的なアセンブリ品質の大幅な向上を示していた。特にMixは、de novoプロジェクトの場合のように、標準的なアセンブリ統計のみによって選択が導かれる場合でも、より良い全体的な品質結果を提供することで一貫していた。

Mixはバクテリア向けに設計された、複数のconitgをマージしてより長いcontigを作る方法論。うまく使えば、細分化されたcontigをからより長いcontigを作ることができる。

インストール

依存

python 2.7 (http://www.python.org/)
python-networkx 1.6-2 (http://networkx.lanl.gov/)
Biopython (http://biopython.org/wiki/Main_Page)
MUMmer 3.22 (http://mummer.sourceforge.net/).

MuMmerはbrewで、Biopythonとnetworkxはpipでそれぞれ導入できる（注　nucmer v４ではエラーを起こした）。

本体　Github

解凍してbin/のMix.pyにパスを通しておく。

実行方法

３つのcontigを合体する。

preprocessing.py -o contigs.fa Assembly1.fa Assembly2.fa Assembly3.fa

#catでも動く
cat Assembly1.fa Assembly2.fa Assembly3.fa > merged.fasta

３つの contig.faがコンカテネートして１つのfastaファイル（contigs.fa）ができる。

あとは以下の３つのコマンドを実行する。

nucmer --maxmatch -c 30 -l 30 -banded -prefix=alignments contigs.fa contigs.fa
show-coords -rcl alignments.delta > alignments.coords 
Mix.py -a alignments.coords -c contigs.fa -o output_dir/ -C 300 -A 200

Mixのflag一覧を貼っておく。



Options:

  -h, --help            show this help message and exit

  -a ALN, --aln=ALN     the file containing the alignments (.coords)

  -o OUT, --out=OUT     the output directory where the scaffolds will be

                        written (it must already exist)

  -c CTG, --ctg=CTG     the multi FASTA file containing all the contigs that

                        were used in the alignment

  -A ATH, --ath=ATH     minimum length of alignment (default:0)

  -C CTH, --cth=CTH     minimum length of contig (default:500)

  -d, --dot             write the graphs in dot format

  -g, --graph           write the graphs in cytoscape format

  -r, --restrict-to-aligned

                        If on, restrict output to aligned coords

output_dir/Mix_results_A200_C300/にいくつかのファイルができる。

user$ ls -lh output_dir/Mix_results_A200_C300/*

-rw-r--r-- 1 user staff 2.7M 9 1 22:41 output_dir/Mix_results_A200_C300/Mix_assembly.fasta

-rw-r--r-- 1 user staff 1.0M 9 1 22:41 output_dir/Mix_results_A200_C300/Mix_assembly_ctg.fasta

-rw-r--r-- 1 user staff 1.7M 9 1 22:41 output_dir/Mix_results_A200_C300/Mix_assembly_path.fasta

-rw-r--r-- 1 user staff 0B 9 1 22:41 output_dir/Mix_results_A200_C300/Mix_selected_isolated_contigs.csv

-rw-r--r--@ 1 user staff 433K 9 1 22:41 output_dir/Mix_results_A200_C300/all_alignments.csv

-rw-r--r--@ 1 user staff 28K 9 1 22:41 output_dir/Mix_results_A200_C300/all_contigs.csv

-rw-r--r-- 1 user staff 22K 9 1 22:41 output_dir/Mix_results_A200_C300/initial_assembly_graph.gml

-rw-r--r-- 1 user staff 26K 9 1 22:41 output_dir/Mix_results_A200_C300/reduced_assembly_graph.gml

Mix_assembly.fastaがmixされて長くなったcontig.faである。

テスト

edena、SOAPdenovo2、Abyssで作ったcontigをMixにかけてみる。

preprocessing.py -o contigs.fa edena.fa abyss_k77.fa SOAPdenovo_k51.fasta
nucmer --maxmatch -c 30 -l 30 -banded -prefix=alignments contigs.fa contigs.fa
show-coords -rcl alignments.delta > alignments.coords 
Mix.py -a alignments.coords -c contigs.fa -o output_dir/ -C 300 -A 200

quastで評価する。

quast edena.fa abyss_k77.fa SOAPdenovo_k51.fasta Mix_assembly.fasta -R bacteria.fa -o results -t 20

紫がMix.fa、赤がAbyss(k=77)

f:id:kazumaxneo:20170901230632j:plain

一番下がMix.fa。

f:id:kazumaxneo:20170901230652j:plain

contigの断片化が大きく解消されていることがわかる。一方で、図をよく見るとよくわからない部位で切れてもいる。原因は分からない。

ただし、アセンブルが止まるのはリピートだったり何かしら理由があるのが普通です。こういったツールを使うとアーティファクトができてくる恐れもあるので注意して使ってください。

quastは以下で紹介しています。

引用

Finishing bacterial genome assemblies with Mix

Hayssam Soueidan, Florence Maurier, Alexis Groppi, Pascal Sirand-Pugnet, Florence Tardy, Christine Citti, Virginie Dupuy and Macha Nikolski

BMC Bioinformatics 2013 14(Suppl 15):S16

de novoアセンブルしてバリアントをコールするDISCOVAR

small indel assembly structural variations (SV) low complexity 2014 Nature Genetics

DISCOVARは2014年にNature geneticsに載ったバリアントを検出する方法論。シーケンスデータをアセンブルして、バリアントをコールする。ヒトゲノムの構造変化は90%ほどは既存のツールで検出可能だが、残りの構造変化（low-complexity sequenceやsegmental duplications）の検出は困難とされる。DISCOVARはそうした難しい構造変化を検出するために構築されたと書かれている。ただし現在のバージョンではトリッキーな方法を使わないと50-Mb以上のゲノムをアセンブルすることはできない。どちらかといえばスモールゲノム向けのツールになっている。

バリアント検出の過程でアセンブルを行うため、アセンブルに用いることも可能である（DISCOVAR de novo）。

公式サイト

https://software.broadinstitute.org/software/discovar/blog/?page_id=98

マニュアル

https://docs.google.com/document/d/1U_o-Z0dJ0QKiJn86AV2o_YHiFzUtW9c57eh3tYjkINc/edit

入力はilluminaでシーケンスした250-bp以上でインサートサイズが450くらいのペアリードfastq。サンプルはPCR freeのプロトコルで調整されたデータが望ましいとされる。カバレッジも条件があり、60前後がベストと書かれている。条件を箇条書きしておく。

Illumina MiSeq or HiSeq 2500 genome sequencers
PCR-free library preparation
250 base paired end reads (or longer)
~450 base pair fragment size
~60x coverage

アセンブルできるサイズは50-Mbまでに制限されている。

インストール

依存

GCC
SAMtools
GraphViz
Picard

全てbrewで導入できる。

brewでインストールする。macでもbrewでインストールできるが、原因不明の理由で失敗する。ソースからビルドするためのマニュアルもリンクが消えていたので、cent OSにbrewを使いインストールした。

brew install DISCOVAR

実行方法

１、リファンレンスを準備。

PrepareDiscovarGenome REF=reference.fasta

faiファイルといくつか関連するファイルができる。

２、bamにマッピングされたペアリードを抽出し、de novoアセンブルを実行する。その後バリアントをコールする。

Discovar READS=bam OUT_HEAD=asssembly REGIONS=all REFERENCE=reference.fa TMP=temp

コンマセパレートで複数bamを入力できる。READS=filename1,filename2,...

アセンブルがが終わるとassembly~がいくつかできる。（アセンブルできるサイズは50-Mbまで）。REGIONS=all にするとかなりのメモリが要求される。カバレッジx100のゲノム（４メガ）でランすると、peak memory uisageは114GBに達した。

領域を指定するにはREGIONS=1:50000-150000などと記載する（chr1の50000-150000の領域を探索）。regionは複数指定できるようになっている（e.g., REGIONS=chr:start-end,chr:start-end,...）。

３、アセンブルのgraphを可視化する（オプション）

dot -Tps -o assembly.final.ps assembly.final.dot -v
gv assembly.final.ps

公式マニュアルでは、大きなゲノムを解析するために、少しづつ領域をオーバーラップしながらランする方法が書かれています。興味がある人は確認してみてください。アセンブルのgraphについての勉強にもなると思います。

de novo アセンブルによる構造変化検出ツールとして2015年にpublishされたfermikitなどがあります。

引用

Comprehensive variation discovery in single human genomes

Neil I Weisenfeld, Shuangye Yin, Ted Sharpe, Bayo Lau, Ryan Hegarty, Laurie Holmes, Brian Sogoloff, Diana Tabbaa, Louise Williams, Carsten Russ, Chad Nusbaum, Eric S Lander, Iain MacCallum & David B JaffeNeil I Weisenfeld, Shuangye Yin, Ted Sharpe, Bayo Lau, Ryan Hegarty, Laurie Holmes, Brian Sogoloff, Diana Tabbaa, Louise Williams, Carsten Russ, Chad Nusbaum, Eric S Lander, Iain MacCallum & David B Jaffe

Nature Genetics 46, 1350–1355 (2014) doi:10.1038/ng.3121

2017-09-01

バクテリアやアーキアの遺伝子を予測するProdigal（メタゲノムデータセットにも対応）

annotation bacteria 2010 BMC Bioinformatics metagenome ab initio gene prediction 高速なツール食中毒

2019 5/8 インストール追記

2021 7/13 help更新

2022/07/20 追記

2023/08/23 追記

ProdigalはDynamic Programmingの方法論により効率的にバクテリアやアーキアの遺伝子を探すツール。既存の方法は様々存在するが、本手法はまずインプットゲノムを分析してモデルを構築し、それから遺伝子を予測することで、false positiveを抑えtrue callを増やすよう工夫されている。

Prodicalはドラフトゲノムやメタゲノムを扱うことが可能である。動作が非常に高速なのも特徴で、E.coliゲノムだと10秒程度で解析することが可能である。2010年に論文が発表され、2014年時点で600以上引用されている。

本手法はイントロンを含む遺伝子の予測やアノテーションは行われない。また、挿入や欠損によるフレームシフトは考慮されないので、高精度な予測を行うためには、予測に使うゲノムのエラーは入念に取り除かれていなければならない。ウィルスでの動作は検証されていないが、動作はするとされる。最新版はProdigal2で、今後Prodigal3へのアップデートが予定されている。

公式サイト

http://prodigal.ornl.gov

マニュアル

https://github.com/hyattpd/prodigal/wiki

インストール

Github

brewでProdigal2が導入できる。condaでも導入できる。

brew install Prodigal

conda install -c bioconda -y prodigal

> prodigal -h

$ prodigal -h

Usage: prodigal [-a trans_file] [-c] [-d nuc_file] [-f output_type]

[-g tr_table] [-h] [-i input_file] [-m] [-n] [-o output_file]

[-p mode] [-q] [-s start_file] [-t training_file] [-v]

-a: Write protein translations to the selected file.

-c: Closed ends. Do not allow genes to run off edges.

-d: Write nucleotide sequences of genes to the selected file.

-f: Select output format (gbk, gff, or sco). Default is gbk.

-g: Specify a translation table to use (default 11).

-h: Print help menu and exit.

-i: Specify FASTA/Genbank input file (default reads from stdin).

-m: Treat runs of N as masked sequence; don't build genes across them.

-n: Bypass Shine-Dalgarno trainer and force a full motif scan.

-o: Specify output file (default writes to stdout).

-p: Select procedure (single or meta). Default is single.

-q: Run quietly (suppress normal stderr output).

-s: Write all potential genes (with scores) to the selected file.

-t: Write a training file (if none exists); otherwise, read and use

the specified training file.

-v: Print version number and exit.

ラン

ランには３つのモードがある。詳細は公式マニュアルの説明を確認する。

フィニッシュしたゲノム（一本になっておりNがないゲノム）や巨大ウィルスはノーマルモードでランする。ノーマルモードはまずゲノムを分析し、それからコード領域を予測する。ハイクオリティなモデル構築のため、ゲノムは100-kb (3') ~ 500-kb (5') の長さ以上である事が推奨されている。

ノーマルモード。

prodigal -i my.genome.fna -o gene.coords.gbk -a protein.faa -d nucleotide.fna

-i　Specify FASTA/Genbank input file (fasta推奨).
-o　Specify output file (default writes to stdout).
-a　Write protein translations to the selected file.
-d　Specify nucleotide sequences file.
-c　Closed ends. Do not allow partial genes at edges of sequence.
-f　Specify output format.

　　　　gbk: Genbank-like format (Default)

　　　　gff: GFF format

　　　　sqn: Sequin feature table format

　　　　sco: Simple coordinate output

出力はfaa、fna、genbank like、gtfなどから選択する。 genbankライクな出力は以下のようになっている。

f:id:kazumaxneo:20170901133923j:plain

アセンブルして作ったcontigの状態でもノーマルモードを利用できる。ノーマルモードでは、１本でも長いcontigがあれば、そこからモデルを構築し遺伝子を予測することが可能である。ただし、あまりにもバラバラだとエラーが出る。その場合は、近縁種の情報からモデルを構築してランできるトレーニングモードを使う。

ノーマルモードは同じプロファイルを入力に対して当てはめるので、メタゲノムのデータに対してそのまま使ってはならない。通常、メタゲノムアセンブルした全配列データはアノニマスモードでランする。アノニマスモードではプリセットのデータを使い解析される。100-kb以下のプラスミド、ファージ、ウィルスもアノニマスモード使用が推奨される。

アノニマスモード（Prodigal2ではmetagenome）。

prodigal -i metagenome.fna -o coords.gbk -a proteins.faa -p meta

-p　Select procedure (single or meta). Default is single.

ノーマルモードでランすれば自分の配列で訓練するので精度が上がる。ノーマルモードを使いたいなら、contigをbinningしてMAGs配列を個々に指定する。アノニマスモードはsmall plasmidやlow qualityなゲノムにも使えるとされる。

追記

20-kbなど短い配列もアノニマスモードで実行するとノーマルモードより精度が上が流傾向が見られる。特にどこを開始コドンをするかなどは影響を受けやすい。アノニマスモードを使いたくないなら、20-kbなど短い配列に元のゲノム配列をcatなどで結合してノーマルモードで実行し、十分な訓練が行えるように工夫する（あとでアノテーションから分離する。）。このtipsは、prokkaやDFASTでも同様に使える。

フィニッシュした近縁種のゲノム情報が利用できるなら、ドラフトゲノムの予測はトレーニングモードが利用できる。トレーニングモードは、近縁種のゲノムでトレーニングデータを作成し、それを元にゲノム予測を行う方法である。ドラフトゲノムのクオリティが不十分な時に使う。

トレーニングモード

１、ゲノム１を使い、トレーニングデータを出力する(-p trainは認識しないので使わない。よく似たゲノムを使わないと性能が出ない)。

prodigal -i genome1.fna -t genome1.trn

-t　Write a training file (if none exists); otherwise, read and use the specified training file.

２、ゲノム１のトレーニングデータを使い、ゲノム２の遺伝子を予測。

prodigal -i genome2.fna -t genome1.trn -o genome2.gbk -a genome2.faa

大抵のバクテリアやアーキアはgenetic code11を使うが、code11で遺伝子が短すぎる場合、自動で他のcodeに切り替えるように設計されている（詳細）。

-g　Specify a translation table to use (default 11).

-mをつけると、Nがある領域の遺伝子が予測されなくなる。

-m: Treat runs of N as masked sequence; don't build genes across them.

その他

GNU parallelsで並列化。singleモード。

mkdir gbk #.gff保存dir
mkdir Protein #.faa保存dir
#40 parallel　（問題なければ--dry-runを外す）
ls *.fna | parallel -j 40 --dry-run 'prodigal -i {} -o {.}.gbk -a {.}.faa -p single && mv {.}.gbk GFF/ && mv {.}.faa Protein/'

Pyrodigal : Cython bindings and Python interface to Prodigal

https://pyrodigal.readthedocs.io/en/stable/index.html

公式マニュアルには使い方やターゲットについて丁寧に説明されています。ぜひ一度確認してみてください（リンク）。

引用

Prodigal: prokaryotic gene recognition and translation initiation site identification

Doug HyattEmail author, Gwo-Liang Chen, Philip F LoCascio, Miriam L Land, Frank W Larimer and Loren J Hauser

BMC Bioinformatics 2010 11:119

tRNAやtmRNAをゲノムから素早く検出する ARAGORN

RNA seq annotation 2004 Nucleic Acids Research tRNA web tool

2019 2/15 Biocondaインストール追加、バッチモード追加

2019 3/10 タイトル修正

2019 5/50 インストール方法追記

ARAGORNは既存のtRNAとのホモロジーや二次構造などを手掛かりにゲノム中からtRNAやtmRNAを探すツール。

webサーバー

ARAGORN, tRNA (and tmRNA) detection

インストール

brewで導入できる。

#biocoonda
conda install -c bioconda -y ARAGORN

#homebrew
brew install ARAGORN

実行方法

aragorn genome.fa -o tRNAs_output

ポジションごとにtRNAが予測される。

f:id:kazumaxneo:20170831235702j:plain

最後にまとめが表示される。

f:id:kazumaxneo:20170831235757j:plain

例えばここで使ったゲノムではバクテリアのクロモソームから45 tRNAが見つかり、プラスミドからはtRNAは見つからなかった。tRNAのアンチコドンの頻度も表示されている。

genetic codeがスタンダードかどうかなどはフラグを立てて設定する。ARAGORN -hで確認できるフラグの一覧を貼っておく。

user$ ARAGORN -h

----------------------------

ARAGORN v1.2.36 Dean Laslett

----------------------------

Please reference the following papers if you use this

program as part of any published research.

Laslett, D. and Canback, B. (2004) ARAGORN, a

program for the detection of transfer RNA and transfer-messenger

RNA genes in nucleotide sequences

Nucleic Acids Research, 32;11-16

Laslett, D. and Canback, B. (2008) ARWEN: a

program to detect tRNA genes in metazoan mitochondrial

nucleotide sequences

Bioinformatics, 24(2); 172-175.

ARAGORN detects tRNA, mtRNA, and tmRNA genes.

Usage:

aragorn -v -s -d -c -l -a -w -j -ifro<min>,<max> -t -mt -m -tv -gc -seq -br -fasta -fo -o <outfile> <filename>

<filename> is assumed to contain one or more sequences

in FASTA format. Results of the search are printed to

STDOUT. All switches are optional and case-insensitive.

Unless -i is specified, tRNA genes containing introns

are not detected.

-m Search for tmRNA genes.

-t Search for tRNA genes.

By default, both are detected. If one of

-m or -t is specified, then the other

is not detected unless specified as well.

-mt Search for Metazoan mitochondrial tRNA

genes. -i switch ignored. Composite

Metazoan mitochondrial genetic code used.

-mtmam Search for Mammalian mitochondrial tRNA

genes. -i switch ignored. -tv switch set.

Mammalian mitochondrial genetic code used.

-mtx Same as -mt but low scoring tRNA genes are

not reported.

-gc<num> Use the GenBank transl_table = <num> genetic code.

-gcstd Use standard genetic code.

-gcmet Use composite Metazoan mitochondrial genetic code.

-gcvert Use Vertebrate mitochondrial genetic code.

-gcinvert Use Invertebrate mitochondrial genetic code.

-gcyeast Use Yeast mitochondrial genetic code.

-gcprot Use Mold/Protozoan/Coelenterate mitochondrial genetic code.

-gcciliate Use Ciliate genetic code.

-gcflatworm Use Echinoderm/Flatworm mitochondrial genetic code.

-gceuplot Use Euplotid genetic code.

-gcbact Use Bacterial/Plant Chloroplast genetic code.

-gcaltyeast Use alternative Yeast genetic code.

-gcascid Use Ascidian Mitochondrial genetic code.

-gcaltflat Use alternative Flatworm Mitochondrial genetic code.

-gcblep Use Blepharisma genetic code.

-gcchloroph Use Chlorophycean Mitochondrial genetic code.

-gctrem Use Trematode Mitochondrial genetic code.

-gcscen Use Scenedesmus obliquus Mitochondrial genetic code.

-gcthraust Use Thraustochytrium Mitochondrial genetic code.

Individual modifications can be appended using

,BBB=<aa> B = A,C,G, or T. <aa> is the three letter

code for an amino-acid. More than one modification

can be specified. eg -gcvert,aga=Trp,agg=Trp uses

the Vertebrate Mitochondrial code and the codons

AGA and AGG changed to Tryptophan.

-tv Do not search for mitochondrial TV replacement

loop tRNA genes. Only relevant if -mt used.

-i Search for tRNA genes with introns in

anticodon loop with maximum length 3000

bases. Minimum intron length is 0 bases.

Ignored if -m is specified.

-i<max> Search for tRNA genes with introns in

anticodon loop with maximum length <max>

bases. Minimum intron length is 0 bases.

Ignored if -m is specified.

-i<min>,<max> Search for tRNA genes with introns in

anticodon loop with maximum length <max>

bases, and minimum length <min> bases.

Ignored if -m is specified.

-io Same as -i, but allow tRNA genes with long

introns to overlap shorter tRNA genes.

-if Same as -i, but fix intron between positions

37 and 38 on C-loop (one base after anticodon).

-ifo Same as -if and -io combined.

-ir Same as -i, but search for tRNA genes with minimum intron

length 0 bases, and only report tRNA genes with minimum

intron length <min> bases.

-c Assume that each sequence has a circular

topology. Search wraps around each end.

Default setting.

-l Assume that each sequence has a linear

topology. Search does not wrap.

-d Double. Search both strands of each

sequence. Default setting.

-s or -s+ Single. Do not search the complementary

(antisense) strand of each sequence.

-sc or -s- Single complementary. Do not search the sense

strand of each sequence.

-ss Use the stricter canonical 1-2 bp spacer1 and

1 bp spacer2. Ignored if -mt set. Default is to

allow 3 bp spacer1 and 0-2 bp spacer2, which may

degrade selectivity.

-ps Lower scoring thresholds to 95% of default levels.

-ps<num> Change scoring thresholds to <num> percent of default levels.

-rp Flag possible pseudogenes (score < 100 or tRNA anticodon

loop <> 7 bases long). Note that genes with score < 100

will not be detected or flagged if scoring thresholds are not

also changed to below 100% (see -ps switch).

-seq Print out primary sequence.

-br Show secondary structure of tRNA gene primary

sequence with round brackets.

-fasta Print out primary sequence in fasta format.

-fo Print out primary sequence in fasta format only

(no secondary structure).

-fon Same as -fo, with sequence and gene numbering in header.

-fos Same as -fo, with no spaces in header.

-fons Same as -fo, with sequence and gene numbering, but no spaces.

-j Display 4-base sequence on 3' end of astem

regardless of predicted amino-acyl acceptor

length.

-jr Allow some divergence of 3' amino-acyl acceptor

sequence from NCCA.

-jr4 Allow some divergence of 3' amino-acyl acceptor

sequence from NCCA, and display 4 bases.

-v Verbose. Prints out search progress

to STDERR.

-a Print out tRNA domain for tmRNA genes

-o <outfile> print output into <outfile>. If <outfile>

exists, it is overwritten.

By default, output goes to STDOUT.

-w Print out genes in batch mode.

For tRNA genes, output is in the form:

Sequence name

N genes found

1 tRNA-<species> [locus 1] <Apos> (nnn)

i(<intron position>,<intron length>)

N tRNA-<species> [Locus N] <Apos> (nnn)

i(<intron position>,<intron length>)

N is the number of genes found

<species> is the tRNA iso-acceptor species

<Apos> is the tRNA anticodon relative position

(nnn) is the tRNA anticodon base triplet

i means the tRNA gene has a C-loop intron

For tmRNA genes, output is in the form:

n tmRNA(p) [Locus n] <tag offset>,<tag end offset>

p means the tmRNA gene is permuted

リストを作るならバッチモードでランする

ARAGORN genome.fa -o output -w

>NODE_1

38 genes found

1 tRNA-Phe c[27015,27088] 35 (gaa)

2 tRNA-Val c[27566,27640] 35 (tac)

3 tRNA-Arg c[41028,41100] 35 (gcg)

4 tRNA-Asn [46340,46415] 35 (gtt)

5 tRNA-Met [46592,46666] 36 (cat)

6 tRNA-Arg [145784,145858] 35 (ccg)

7 tRNA-Arg c[255331,255424] 35 (tcg)

8 tRNA-Tyr [289892,289966] 35 (gta)

9 tRNA-Gln [469893,469965] 35 (ttg)

10 tRNA-Ser [534698,534785] 28 (gct)

11 tRNA-Ser [585265,585349] 36 (cga)

12 tRNA-Gly c[699904,699975] 34 (gcc)

13 tRNA-Val c[752912,752986] 36 (gac)

14 tRNA-Val c[802618,802696] 37 (cac)

15 tRNA-Ala [841719,841790] 33 (ggc)

16 tRNA-Thr [882862,882935] 35 (cgt)

17 tRNA-Gly c[920091,920169] 38 (gcc)

18 tRNA-Arg c[921771,921847] 37 (tcg)

19 tRNA-Thr c[922371,922444] 35 (ggt)

20 tRNA-His c[922446,922519] 35 (gtg)

21 tRNA-Glu c[923590,923666] 37 (ttc)

22 tRNA-Cys c[923671,923744] 35 (gca)

23 tRNA-Asn c[923783,923858] 36 (gtt)

24 tRNA-Lys c[923936,924011] 36 (ttt)

25 tRNA-Tyr c[924144,924216] 35 (gta)

26 tRNA-Ala c[924221,924294] 34 (cgc)

27 tRNA-Met c[924317,924391] 37 (cat)

28 tRNA-Leu c[924394,924476] 34 (cag)

29 tRNA-Leu c[924509,924590] 35 (gag)

30 tRNA-Leu c[924593,924676] 36 (taa)

31 tRNA-Trp c[924678,924748] 33 (cca)

32 tRNA-Ser c[924751,924838] 38 (gga)

33 tRNA-Ser c[924839,924922] 36 (tga)

34 tRNA-Ser c[924957,925046] 36 (gct)

35 tRNA-Pro c[925049,925121] 35 (tgg)

36 tRNA-Asp c[925322,925396] 36 (gtc)

37 tRNA-Gln c[925408,925483] 37 (ttg)

38 tRNA-Arg c[927017,927091] 36 (tcg)

引用

ARAGORN, a program to detect tRNA genes and tmRNA genes in nucleotide sequences

Dean Laslett Bjorn Canback

Nucleic Acids Research, Volume 32, Issue 1, 1 January 2004, Pages 11–16,

2017-08-31

NGSのスモールユーティリティツール Ngs crumbs

fasta/fastqの操作

2020 7/26 構成を修正

Ngs crumbsはfastqデータの様々な処理ができるツール群。本体が１つあるわけではなく、たくさんのユーティリィスクリプトが集まったツールセットとなっている。論文にはなっていないが、いくつかの論文でデータ処理に利用されている。マニュアルが乏しいが、使えそうなコマンドに絞って紹介する。

公式サイト

https://bioinf.comav.upv.es/seq_crumbs/

インストール

依存

seq_crumbs depends on Python 2.7. Biopython is a recommended dependency. The installation manual is located in the doc/install.rst document.

sudo pip install biopython
sudo pip install toolz

Github

リリースからダウンロードして解凍する。

tar -xvzf seq_crumbs-0.1.tar.gz
cd seq_crumbs-0.1
sudo python setup.py install

ラン

ペアリードからインターレースのfastqを作る。

interleave_pairs R1.fq R2.fq > merged.fq

-z　Compress the output in gzip format
-Z　Compress the output in bgzf format
-B　Compress the output in bzip2 format

インターレースからペアリードに分離する。

interleave_pairs merged.fq -o R1.fq R2.fq

-o　Sequence output file (default STDOUT)

合計リード数とトータルサイズ（bp）。

count_seqs input.fq

-t　IN_FORMAT Format of the input files (default: guess)

ペアじゃないリードを除く。

pair_matcher interelace.fq -o output.fq -p orphan.fq

--low_memory　If the binary uses all memory and does not finish, use this option (default False)
--limit　Maximum number of reads in memory (default: 1000000)
-u　Paired reads are unordered and not just interleaved
-p　ORPHAN Output orphan file (required)

-uフラグをつけると順番が揃っていないペアにも対応可能（未確認）。

ランダムに10リードを取り出す。

sample_seqs input.fq -o random.fq

-n　NUM_SEQS Number of sequences to print (default: 10)

ランするたびにランダムな部位が取り出される。

fastqのフォーマットを変更する（fasta、fastq、fastq-illumina）。

 convert_format input.fastq -f fasta -o output.fa

-f　{fasta,fastq,fastq-illumina} Output file format

duplicateリードを除いたりクオリティフィルタリングできるツールもあるが、詳細が書いてない。ここでは似たことができるBBtoolsをお勧めします。BBtoolsはアルゴリズムが明記されており、マルチスレッド対応で高速に動作します。

引用

seq_crumbs — Bioinformatics at COMAV 0.1 documentation

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

メタゲノムデータを種レベルで検出し割合を計算するmOTUとfetch-MG

メタゲノムデータをbinningして出力可能なGUIアプリ VizBin

複数のcontigをマージしアセンブリの連続性を改善する Mix

de novoアセンブルしてバリアントをコールするDISCOVAR

バクテリアやアーキアの遺伝子を予測するProdigal（メタゲノムデータセットにも対応）

tRNAやtmRNAをゲノムから素早く検出する ARAGORN

NGSのスモールユーティリティツール Ngs crumbs