2024-04-25

細菌の保存されたタンパク質の割合を計算するためのNextflowパイプライン POCP-nf

　シーケンス技術の進歩により、細菌ゲノムは飛躍的に増加しており、確実な分類法が必要とされている。Qin et al. (2014)によって最初に提案されたPercentage Of Conserved Proteins (POCP)は、原核生物の属境界を評価するための貴重な指標である。ここでは、分類学的研究における再現性と使いやすさを高めることを目的として、POCPを自動計算するための計算パイプラインを紹介する。POCP-nfパイプラインは、DIAMONDを使用し、BLASTPと同程度の感度でタンパク質のアラインメントを高速化する。パイプラインはNextflowで実装され、CondaとDockerをサポートし、GitHubでhttps://github.com/hoelzer/pocpで公開されている。オープンソースのコードは、様々な原核生物のゲノムやタンパク質のデータセットに簡単に適応できる。詳細なドキュメントと使い方はリポジトリにある。

レポジトリより

シーケンス技術の進歩により、細菌ゲノムは飛躍的に増加しており、確実な分類法が必要とされている。Qin, Xie et al. 2014によって最初に提案されたPOCP（Percentage Of Conserved Proteins）は、原核生物の属境界を評価するための貴重な指標である。原核生物の属は、すべてのペアワイズPOCP値が50％より高い種のグループとして定義できる。ここでは、POCPを自動計算するための計算パイプラインを紹介し、分類学的研究における再現性と使いやすさの向上を目指す。

このPOCPのためのパイプラインは、入力として、ProkkaやBaktaで提供されているような、ゲノムごとに1つのアミノ酸配列FASTAファイル、またはゲノムFASTAファイルを使用する。パイプラインは、DIAMONDのblastpモードを使用して、すべてのタンパク質配列間のall-vs-allペアワイズアラインメントを計算し、Qin, Xie et al. 2014のオリジナルの式に従ってPOCP計算にこの情報を使用する。1対全体の比較も可能である。

インストール

依存

nextflow
For installing the dependencies (such as Prokka and DIAMOND), you can choose between conda, mamba, docker or singularity. Author recommend using docker.

Github

https://github.com/hoelzer/pocp

nextflow pull hoelzer/pocp

> nextflow run hoelzer/pocp -r 2.3.0 --help
N E X T F L O W ~ version 22.10.4
Launching `https://github.com/hoelzer/pocp` [disturbed_jones] DSL2 - revision: 092166e822 [2.3.0]

Profile: standard

Current User: kazu
Nextflow-version: 22.10.4
Starting time: 09-12-2022 09:58 UTC
Workdir location:
/home/kazu/work

____________________________________________________________________________________________

P.O.C.P - calculate percentage of conserved proteins.

A prokaryotic genus can be defined as a group of species with all pairwise POCP values higher than 50%.

Usage example:
nextflow run hoelzer/pocp -r 2.3.0 --genomes '*.fasta'
or
nextflow run hoelzer/pocp -r 2.3.0 --proteins '*.faa'

Use the following commands to check for latest pipeline versions:

nextflow pull hoelzer/pocp
nextflow info hoelzer/pocp

Input
All-vs-all comparisons (default):
--genomes '*.fasta' -> one genome per file
or
--proteins '*.faa' -> one protein multi-FASTA per file
..change above input to csv: --list

Perform one-vs-all comparison against the additionally defined genome or protein FASTA (optional):
--genome genome.fasta -> one genome FASTA
or
--protein proteins.faa -> one protein multi-FASTA

General Options:
--gcode Genetic code for Prokka annotation [default: 0]
--cores Max cores per process for local use [default: 32]
--max_cores Max cores (in total) for local use [default: 128]
--memory Max memory for local use [default: 4 GB]
--output Name of the result folder [default: results]

Special Options (Danger Zone!):
ATTENTION: changing these parameters will lead to different POCP values.
If you have good reasons to do that, you must report the changed parameters together with the used pipeline version.

--evalue Evalue for DIAMOND protein search [default: 1e-5]
--seqidentity Sequence identity for DIAMOD alignments [default: 0.4]
--alnlength Alignment length for DIAMOND hits [default: 0.5]
--blastp Use BLASTP instead of DIAMOND for protein alignment (slower, as in the original 2014 publication) [default: false]

Nextflow options:
-with-report rep.html cpu / ram usage (may cause errors)
-with-dag chart.html generates a flowchart for the process tree
-with-timeline time.html timeline (may cause errors)
-resume resume a previous calculation w/o recalculating everything (needs the same run command and work dir!)

Caching:
--condaCacheDir Location for storing the conda environments [default: conda]
--singularityCacheDir Location for storing the Singularity images [default: conda]
-w Working directory for all intermediate results [default: work]

Execution/Engine profiles:
The pipeline supports profiles to run via different Executers and Engines e.g.: -profile local,conda

Executer (choose one):
local
slurm

Engines (choose one):
conda
mamba
docker
singularity

Per default: -profile local,conda is executed.

テストラン

#genome with local and docker
nextflow run hoelzer/pocp -r 2.3.0 --genomes $HOME'/.nextflow/assets/hoelzer/pocp/example/*.fasta' -profile local,docker

#protein with SLURM execution and conda
nextflow run hoelzer/pocp -r 2.3.0 --proteins $HOME'/.nextflow/assets/hoelzer/pocp/example/*.faa' -profile slurm,conda

デフォルトでは作業ディレクトリworkに中間ファイルが保存され、出力はresultsに保存される。

出力例

> ls -lth

> column pocp-matrix.tsv

レポジトリより

このパイプラインは、DIAMONDをblastpモードで使用して生物種間のオルソログタンパク質を同定する。オリジナルのPOCP論文ではBLASTPを用いてアラインメントを計算している。しかしDIAMONDは高速であり、大きな入力データセットに対してPOCP値を計算する際に有利であるだけでなく、特にパイプラインのデフォルトで有効になっている--ultra-sensitiveモードを使用した場合、BLASTPの感度を達成できる（Buchfink (2021)）。
異なるアライメントプログラムを比較した別の研究では、デフォルト設定以外の感度オプションを選択した場合、DIAMONDがスピード、感度、品質の最適な妥協点を提供することがわかった（Hernández-Salmerón and Moreno-Hagelsieb (2020)）。そこで、POCP-nfのアライメント計算では、より現代的なソリューションとしてBLASTPの代わりにDIAMONDを使う。

引用

POCP-nf: an automatic Nextflow pipeline for calculating the percentage of conserved proteins in bacterial taxonomy

Martin Hölzer

Bioinformatics, Volume 40, Issue 4, April 2024

DIAMONDの--ultra-sensitiveはランタイムがデフォルトよりかなり増えますが、その設定でもDIAMONDはBLASTPと比べて半分の時間で計算が終わると書かれていますね。

2024-04-24

メタゲノムアセンブリの高精度なbin refinementツール Binette

2024 Preprint Binning (metagenomics) metagenome

2024/04/24 誤字修正

　メタゲノム解析は、ショットガンシーケンスによる微生物群集とその個々のメンバーの研究を可能にする。メタゲノム解析に不可欠な段階は、メタゲノムアセンブリゲノム（MAG）の回収である。メタゲノム解析では、シーケンスリードをコンティグにアセンブルし、それを共通の特徴に基づいてビンにグループ化し、MAGを生成する。メタゲノムデータセットからより多くの、より質の高いMAGを得るための有用なアプローチは、複数のビニング法を適用し、それらをbin refinementと呼ばれるプロセスで組み合わせることである。本著者らは、metaWRAPのbin refinementモジュールにインスパイアされたbin refinementツールであるBinetteを紹介する。Binetteは、入力されたビンアセンブリから基本的な集合演算を用いて新しいハイブリッドビンを作成することでこれを実現する。その後、CheckM2を用いてビンの品質を評価し、可能な限り最良のビンを選択する。

レポジトリより

Binetteは、複数のビニングツールの出力から高品質のMAGを構築する、高速かつ高精度なビニング精密化ツールである。入力ビンセットから、Binetteは新しいハイブリッドビンを構築する。ビンはコンティグの集合と見なすことができる。少なくとも2つのビンがオーバーラップしている場合、つまり少なくとも1つのコンティグを共有している場合、Binetteは基本的なセット操作を利用して新しいビンを作成する。

交差ビン：オーバーラップしているビンで共有されているコンティグから構成される。
差分ビン：あるビンにのみ存在し、他のビンには存在しないコンティグを含む。
Unionビン：ユニオンビンには、オーバーラップしているビンに含まれるすべてのコンティグが含まれる。

その後、CheckM2を用いてビンの品質を評価し、最終的に最良のビン配列を選択する。BinetteはmetaWRAP bin-refinementツールからインスパイアされているが、そのツールの問題点をすべて効果的に解決している。

スピードの向上: Binetteは洗練プロセスのスピードを大幅に向上させた。Binetteは、CheckM2の最初のステップであるProdigalやDiamondの実行を、全てのコンティグに対して一度だけ実行することにより、これを実現している。これらの中間結果は、任意のビンの品質を評価するために利用され、冗長な計算を排除し、精密化プロセスを高速化する。
入力ビンセットの制限なし： Binetteは、その前身とは異なり、入力ビンセットの数に制約を受けない。複数のビンセットを同時に処理することができます。

インストール

ubuntu20でテストした（way1の手順に従った）。

Github

#way1 conda environnement
git clone https://github.com/genotoul-bioinfo/Binette
cd Binette
mamba env create -n binette -f binette.yaml
conda activate binette
pip install .

#way2 conda(link)
mamba install -c conda-forge -c bioconda -c defaults binette -y
conda activate binette

#way3 PyPI(link)
pip install Binette
#他の依存
pip install pyfastx pyrodigal tqdm pandas checkm2

> binette -h

usage: binette [-h] (-d BIN_DIRS [BIN_DIRS ...] | -b CONTIG2BIN_TABLES [CONTIG2BIN_TABLES ...]) -c CONTIGS [-m MIN_COMPLETENESS] [-t THREADS] [-o OUTDIR] [-w CONTAMINATION_WEIGHT] [--checkm2_db CHECKM2_DB] [--low_mem] [-v] [--debug] [--resume]

[--version]

Binette version=1.0.0

optional arguments:

-h, --help show this help message and exit

Input Arguments:

-d BIN_DIRS [BIN_DIRS ...], --bin_dirs BIN_DIRS [BIN_DIRS ...]

List of bin folders containing each bin in a fasta file. (default: None)

-b CONTIG2BIN_TABLES [CONTIG2BIN_TABLES ...], --contig2bin_tables CONTIG2BIN_TABLES [CONTIG2BIN_TABLES ...]

List of contig2bin table with two columns separated with a tabulation: contig, bin (default: None)

-c CONTIGS, --contigs CONTIGS

Contigs in fasta format. (default: None)

Other Arguments:

-m MIN_COMPLETENESS, --min_completeness MIN_COMPLETENESS

Minimum completeness required for final bin selections. (default: 40)

-t THREADS, --threads THREADS

Number of threads to use. (default: 1)

-o OUTDIR, --outdir OUTDIR

Output directory. (default: results)

-w CONTAMINATION_WEIGHT, --contamination_weight CONTAMINATION_WEIGHT

Bin are scored as follow: completeness - weight * contamination. A low contamination_weight favor complete bins over low contaminated bins. (default: 2)

--checkm2_db CHECKM2_DB

Provide a path for the CheckM2 diamond database. By default the database set via <checkm2 database> is used. (default: None)

--low_mem Use low mem mode when running diamond (default: False)

-v, --verbose increase output verbosity (default: False)

--debug Activate debug mode (default: False)

--resume Activate resume mode. Binette will examine the 'temporary_files' directory within the output directory and reuse any existing files if possible. (default: False)

--version show program's version number and exit

データベース

checkM2のデータベースが必要

checkm2 database --download --path <checkm2/database/>

実行方法

Binetteは2種類のbin情報入力フォーマットをサポートしている。

A - Contig2binテーブル：各コンティグと対応するbinの関係を確立するcontig2binテーブルを使ってbinセットを提供する。このフォーマットは”--contig2bin_tables”オプションを使って指定する。

（レポジトリより転載）

ランするには、binのTSVファイルとメタゲノムアセンブリのfastaを指定する。

binette --contig2bin_tables bin_set1.tsv bin_set2.tsv -c assembly.fasta -t 20 -o outdir

-b, --contig2bin_tables List of contig2bin table with two columns separated with a tabulation: contig, bin (default: None)
-c Contigs in fasta format. (default: None)
-t Number of threads to use. (default: 1)
-o Output directory. (default: results)

B - binディレクトリの指定：各binが別々のFASTAファイルで保存されている場合。このフォーマットでは、-bin_dirs引数を指定する。

ランするには、binディレクトリとメタゲノムアセンブリのfastaを指定する。

binette --bin_dirs bin_set1/ bin_set2/ --contigs assembly.fasta -t 20 -o outdir

-d, --bin_dirs List of bin folders containing each bin in a fasta file. (default: None)

出力例

outdir/

final_bins/

選択されたすべてのビンがfasta形式で格納されている。

final_quality_reports.tsv

フォーマットについてはレポジトリで説明されています。

その他（論文とレポジトリより）

出力のtemporary_filesディレクトリには中間ファイルが格納される。--resumeオプションを選択すると、temporary_filesを利用して迅速に結果を出力できる（パラメータ変更時など）。
CAMI IIチャレンジの準備のためにリリースされた模擬マウス腸内メタゲノムデータを使用してビニング性能が評価されている。DAS ToolとBinetteはいずれも平均ビンの完全性と純度の統計量両方において同等かつ優れた結果を示したが、Binetteは高品質ゲノムの回収においてDAS Toolをより上回っていた。

引用

Binette: a fast and accurate bin refinement tool to construct high quality Metagenome Assembled Genomes.

Jean Mainguy, Claire Hoede

bioRxiv, Posted April 22, 2024.

複雑な反復配列を迅速にインタラクティブなドットプロットで可視化する ModDotPlot

2024 Preprint dot plot web tool 結果の視覚化 (visualization) 高速なツール centromere tandem repeats

　ゲノムの反復配列を分析する一般的な方法は、ドットプロットによって可視化された配列類似性マトリックスを作成することである。StainedGlassのような革新的なアプローチは、ドットプロットを配列同一性のヒートマップとしてレンダリングすることにより、この古典的な可視化を改良し、研究者がセントロメアやゲノムの他のヘテロクロマティック領域内のマルチメガベースタンデムリピート配列をより良く可視化できるようにした。しかし、ヒートマップの類似度推定を行うには、高い計算オーバーヘッドが必要であり、精度が低下する可能性がある。本研究では、アライメントを必要としないインタラクティブなドットプロットビューアModDotPlotを紹介する。ModDotPlotは、k-merベースの封じ込めインデックスにより平均的なヌクレオチドの同一性を近似することで、StainedGlassよりも何桁も速く正確なプロットを作成する。本著者らは、シロイヌナズナの128Mbpの全ゲノムをラップトップで5分以内に可視化できる階層的モディマイザースキームを使用することにより、これを達成した。ModDotPlotはPythonで実装され、染色体全体のリアルタイム対話的ナビゲーションをサポートするグラフィカルユーザーインターフェースを持つ。ModDotPlotはhttps://github.com/marbl/ModDotPlotから利用できる。

I'll be presenting Mod.Plot, a fast and fully interactive dot plot viewer, today at the @RECOMBseq poster session! https://t.co/lJ8wPFA5DW
— Alex Sweeten (@alexsweeten) 2023年4月14日

レポジトリより

ModDotPlotは、StainedGlassに似た新しいドットプロット可視化ツールである。ModDotPlotは、modimizers を利用して、ゲノムのペアワイズコンビネーション間のContainment Indexを計算し、その平均ヌクレオチド同一性を高速に近似する。これにより、これらのプロットを作成するのに必要な計算時間が大幅に短縮され、リアルタイムで表示することを達成している。

インストール

M1 macでテストした（Rosetta2エミュレーション下）。

Github

https://github.com/marbl/ModDotPlot

git clone https://github.com/marbl/ModDotPlot.git
cd ModDotPlot
python -m venv venv
source venv/bin/activate
python -m pip install .

> moddotplot -h

__ __ _ _____ _ _____ _ _

| \/ | | | | __ \ | | | __ \| | | |

| \ / | ___ __| | | | | | ___ | |_ | |__) | | ___ | |_

| |\/| |/ _ \ / _` | | | | |/ _ \| __| | ___/| |/ _ \| __|

| | | | (_) | (_| | | |__| | (_) | |_ | | | | (_) | |_

|_| |_|\___/ \__,_| |_____/ \___/ \__| |_| |_|\___/ \__|

v0.8.1

usage: moddotplot [-h] {interactive,static} ...

ModDotPlot: Visualization of Complex Repeat Structures

positional arguments:

{interactive,static} Choose mode: interactive or static

interactive Interactive mode commands

static Static mode commands

options:

-h, --help show this help message and exit

> moddotplot interactive -h

__ __ _ _____ _ _____ _ _

| \/ | | | | __ \ | | | __ \| | | |

| \ / | ___ __| | | | | | ___ | |_ | |__) | | ___ | |_

| |\/| |/ _ \ / _` | | | | |/ _ \| __| | ___/| |/ _ \| __|

| | | | (_) | (_| | | |__| | (_) | |_ | | | | (_) | |_

|_| |_|\___/ \__,_| |_____/ \___/ \__| |_| |_|\___/ \__|

v0.8.1

usage: moddotplot interactive [-h] (-f FASTA [FASTA ...] | -l LOAD) [-k KMER] [-m MODIMIZER] [-r RESOLUTION] [-w WINDOW] [-id IDENTITY] [-d DELTA] [-o OUTPUT_DIR] [--compare | --compare-only] [-s] [--port PORT] [--ambiguous] [-q] [--no-plot]

options:

-h, --help show this help message and exit

-f FASTA [FASTA ...], --fasta FASTA [FASTA ...]

Path to input fasta file(s).

-l LOAD, --load LOAD Load previously computed hierarchical matrices.

-k KMER, --kmer KMER k-mer length.

-m MODIMIZER, --modimizer MODIMIZER

Modimizer sketch size. A lower value will reduce the number of modimizers, but will increase performance. Must be less than window length `-w`.

-r RESOLUTION, --resolution RESOLUTION

Dotplot resolution, or the number of intervals to compare against.

-w WINDOW, --window WINDOW

Window size, or the length in genomic coordinates of each interval. Default is set to (genome length)/(resolution)

-id IDENTITY, --identity IDENTITY

Identity cutoff threshold.

-d DELTA, --delta DELTA

Fraction of neighboring partition to include in identity estimation. Must be between 0 and 1, use > 0.5 is not recommended.

-o OUTPUT_DIR, --output-dir OUTPUT_DIR

Directory name for saving matrices and coordinate logs. Defaults to working directory.

--compare Create a dotplot for each pairwise combination of input sequences (in addition to self-identity plots).

--compare-only Create a dotplot for each pairwise combination of input sequences (skips self-identity plots).

-s, --save Save hierarchical matrices to file.

--port PORT Port number for launching interactive mode on localhost. Only used in interactive mode.

--ambiguous Preserve diagonal when handling strings of ambiguous homopolymers (eg. long runs of N's).

-q, --quick Launch a quick, non-interactive version of interactive mode.

--no-plot Prevent launching dash after saving. Must be used in combination with --save.

実行方法

インタラクティブモードとStaticモードがある。

1、インタラクティブモード

モードの指定後、fastaファイルを指定する。

#インタラクティブ
moddotplot interactive -f input.fasta

interactive Interactive mode commands
static Static mode commands
-f Fasta files to input. Multifasta files are accepted. Interactive mode will only support a maximum of two sequences at a time.
-o Name of output directory for bed file & plots. Default is current working directory.
--port Port to display ModDotPlot on. Default is 8050, this can be changed to any accepted port.

実行すると、計算が終わるまでしばらく時間がかかる。計算が終わるとマシンのlocalhost上でDashアプリケーションが起動する。ウェブブラウザを開き、http://127.0.0.1:にアクセスする。ここではhttp://127.0.0.1:8050/にアクセスする

（Dash が使用するデフォルトのポート番号は 8050で、 --portオプションを使ってカスタマイズできる）。

plotly のプロットは、ズーム（虫眼鏡）とパン（手）のアイコンを使って拡大と移動ができる。

リセットするには、ダブルクリックするかホームボタンをクリックする。

同一性（Identity）の閾値は右側のスライダーで変更できる。厳しくした。

２，Staticモード

moddotplot static -f input.fasta

計算が終わると染色体ごとにdot plotが出力される。

３，2つの配列間の比較

（マニュアルより）ModDotPlotは、入力配列のペア毎の組み合わせについて、a対bスタイルのドットプロットを作成することができる（注；interactiveモードでは最大2つの配列まで可能）。 self-identity plotの作成をスキップしたい場合は、-compare-onlyを使用する。

#a対bスタイルのドットプロットを作成
moddotplot interactive -f chr14_segment.fa chr21_segment.fa --compare

#a対bスタイルのドットプロットを作成、self-identity plotの作成はスキップ
moddotplot interactive -f chr14_segment.fa chr21_segment.fa --compare-only

--compare If set when 2 or more sequences are input into ModDotPlot, this will show an a vs. b style plot, in addition to a self-identity plot. Note that interactive mode currently only supports a maximum of two sequences. If more than two sequences are input, only the first two will be shown.
--compare-only If set when 2 or more sequences are input into ModDotPlot, this will show an a vs. b style plot, without showing self-identity plots

インタラクティブモードで --compare を実行すると、ドロップダウンメニューが表示され、自己同一性プロットとペアワイズプロットを切り替えることができる。

４，サーバーで実行した時

（マニュアルより）HPC環境でインタラクティブモードを実行するには、ポートフォワーディングを使用する。まずリモートのサーバーにアクセスして、通常通りModDotPlotを実行する。次にローカルマシンでポート転送トンネルをセットアップする。

moddotplot interactive -f in.fasta --port 8050

ssh -N -f -L <LOCAL_PORT_NUMBER>:127.0.0.1:8050 HPC@LOGIN.CREDENTIALS
#例
ssh  -N -f -L 8050:127.0.0.1:8050 kazuamx@xxx.xxx.x.xxx

--port Port to display ModDotPlot on. Default is 8050, this can be changed to any accepted port.

これでhttp://127.0.0.1:にアクセスしてModDotPlotの結果を確認できる。

例

chromeでアクセスした。

その他

ModDotPlotをstaticプロットで実行する場合（特に極端にカスタマイズされたプロットを作成する場合）、設定ファイルを使用することが推奨される。JSON形式の設定ファイルはコマンドライン引数と同じ構文を受け付ける。
論文にStainedGlassと比較したベンチマークがある。CPU時間はStainedGlassより2桁以上短く、メモリ使用量もStainedGlassより1/10以下になっている。

引用

ModDotPlot—Rapid and interactive visualization of complex repeats

Alexander P. Sweeten, Michael C. Schatz, Adam M. Phillippy

bioRixv, Posted April 19, 2024.

昔一度紹介してますが、プレプリントが出て使い方も変わっているので再度紹介しました。

（ヒトとマウス）仮説生成のためにクエリに最も類似した遺伝子発現シグネチャーを検索する RummaGEO

2024 Preprint GEO human whole genome mouse RNA seq web tool GPT

　Gene Expression Omnibus (GEO)は、トランスクリプトミクスやその他のオミックスデータセットのための主要なオープンな生物医学研究リポジトリである。現在、世界中の多くの生物医学研究ラボによって収集された数万件の研究から得られた数百万件の遺伝子発現サンプルが含まれている。GEOリポジトリのユーザーは、関連するデータセットを見つけるために、研究を記述するメタデータを検索できるが、現在のところ、データレベルでGEOのグローバル検索を容易にする方法やリソースはない。この欠点に対処するため、本著者らはRummaGEOを開発した。RummaGEOは、GEOに寄託されたヒトおよびマウスのRNA-seq研究の大規模コレクションを遺伝子発現シグネチャーを検索できるウェブサーバーアプリケーションである。検索エンジンを開発するために、本著者らはARCHS4から入手可能な一様にアラインメントされたGEO研究からサンプル条件をオフラインで自動同定した。次に、これらの研究から遺伝子セットを抽出するために、差次的発現シグネチャーを計算した。現在、RummaGEOには、23,395件のGEO研究から抽出された135,264のヒトと158,062のマウスの遺伝子セットが含まれている。次に、RummaGEOデータベースの内容を解析し、統計的パターンを同定し、様々なグローバル解析を行った。RummaGEOデータベースの内容は、シグネチャー検索、PubMed検索、メタデータ検索機能を備えたウェブサーバー型検索エンジンとして提供されている。全体として、RummaGEOは生物医学研究コミュニティにとって、将来の多くの研究のための仮説生成を可能にする、これまでにないリソースを提供している。RummaGEO検索エンジンはhttps://rummageo.com/で利用できる。

manual

https://rummageo.com/usermanual

webサービス

https://rummageo.com/にアクセスする。主要な３つの検索機能がある。

23,395のGEO研究からのシグネチャーを計算することによって、クエリに一致する最も類似した遺伝子セットを見つける。

1，Gene set検索

遺伝子セット検索ページでは、クエリー遺伝子セットにマッチする遺伝子セットをRummaGEOデータベースから検索することができる。

使用するには1行１遺伝子の形式で遺伝子シンボルのリストを入力する。画像ではヒトのexampleを指定（100遺伝子）。

入力した遺伝子シンボルに基づき、ヒト遺伝子セットコレクションかマウス遺伝子セットとして認識される。

RummaGEOデータベースに含まれる遺伝子セットとクエリ遺伝子セットとの類似度はFisherの正確検定で測定される。有意に重複する遺伝子セットは、付随するメタデータとともにユーザーに返される。

出力例

1番上を見てみる。

左端から、GEOのIDと研究のタイトル、利用可能な場合はPubMed ID（PMID）が表示されている。GEOのIDはGEOにリンクしている。PMIDからはPubMedへジャンプできる。その右には条件１と２のタイトル、誘導か抑制か、解析プラットフォーム、日付、遺伝子セットのサイズ、クエリとの重複数、オッズ、P値、調整後P値、などが並ぶ。

それぞれクリックすると詳細を確認できる。Gene set sizeの数値をクリックした。

表示された遺伝子セットはクリップボードにコピーしたり、ENRICHR（紹介）にダイレクトサブミットする事もできる。さらにこの遺伝子セットをクエリにして再びRummaGEOを実行する事もできる。

Overlapの数値をクリックした。上と同じようにウィンドウ上に遺伝子リストが表示される。

さらに結果を絞り込むには、上の検索バーに特定のキーワードを入れて検索する。macrophagesを入れると最初の29,005ヒットから 3,582ヒットにまで減った。

結果は、検索ウィンドウの右側のボタンからタブ区切り形式でダウンロードできる。

上のタブのCommon terms in Matching Gene setsでは、遺伝子セット検索結果で返されたシグネチャーから最初の5,000のユニークなGSEについて、Fisherの正確検定で計算されたエンリッチメント解析された結果が表示される。

表の上のボタンから、Tissue/cellに切り換えた。macrophagesが最も多い。

グラフの下の表には、エンリッチされたカテゴリーが表示される。

nameをクリックすると、Matching Gene Setsタブに戻り、選択したカテゴリーの遺伝子のセットを見ることができる。

Enrichr Termsタブを選択すると、遺伝子セット検索から返された上位500シグネチャーの中で最もよく出現するEnrichr語彙が表示される。Enrichr Terms は、選択したライブラリ（ChEA 2022、KEGG 2021 Human、WikiPathway 2023 Human、GO Biological Process 2023、MGI Mammalian Phenotype Level 4 2021、Human Phenotype Ontology、GWAS Catalog 2023）のすべての RummaGEO シグネチャーに対して事前に計算されている。

検索結果のタブ（Matching Gene Setsタブ）に戻る。表の右端には、遺伝子セットとRummaGEOシグネチャーのオーバーラップに関する仮説生成のための機能が用意されている。仮説生成ボタンをクリックする。

仮説生成のための説明を入力する必要がある。

RummaGEOは、この説明文と、マッチするRummaGEO遺伝子セットの研究アブストラクト、および複数のEnrichrライブラリ（WikiPathway 2023 Human, GWAS Catalog 2023, GO Biological Process 2023, MGI Mammalian Phenotype Level 4 2021）から重複する遺伝子から上位3つの有意に濃縮された語彙を取得する。このプロンプトは、さらに大規模言語モデル（LLM）に対して、提供されたすべての説明と遺伝子セットのコンテキスト、およびEnrichrからの高度に濃縮された語彙を参照するよう指示する。その後、仮説が解析され、エンリッチメント統計がユーザーの記入した文中に挿入される（マニュアルより）。

２，PubMed 検索
PubMed Search ページでは、PubMed API クエリを用いた PubMed 検索に基づいて RummaGEO の遺伝子セットを検索できる。キーワードを入力して検索する。クエリのワードから関連するGEO研究の論文（最大上位5000件）がサーチされる。結果には、PubMed APIから返された論文と論文数、RummaGEOデータベース内の関連遺伝子セット数が返される。

exampleの”mice aging”で検索した。

出力例

結果は論文ごとにグループ化されている。

補足；画面上に、”クエリにマッチする論文が5,000件以上あるため、クエリから返された最初の5,000件の論文から、遺伝子セットに関連する54件の論文に関連する849件の遺伝子セットのみ表示されました。より良い結果を得るためには、検索条件を絞り込んでください。”と出ている。

その論文の条件ごとの誘導/抑制された遺伝子リストを見るには、右端の矢印をクリックする。

展開すると２つのGEOが折り畳まれていた。この研究では１つGEOのみ含まれ、それが抑制と誘導に分かれて表示されていた。右端のVIEWをクリックすると、上の画像のように含まれる遺伝子リストを表示できる（UPの17遺伝子）。

３，Metadata検索

このタブでは、データベースに含まれるGEO研究のメタデータを直接検索することができる。

出力例

引用

RummaGEO: Automatic Mining of Human and Mouse Gene Sets from GEO

Giacomo B. Marino, Daniel J. B. Clarke, Eden Z. Deng, Avi Ma’ayan

bioRxiv, Posted April 13, 2024.

2024-04-20

FastQCの高速な代替 Falco

F1000Research 2019 trimming / preprocessing 高速なツール

　品質管理はシーケンスデータ解析において不可欠な最初のステップであり、品質管理のためのソフトウェアツールはほとんどのシーケンスセンターで標準的なパイプラインに深く浸透している。関連する計算は簡単だが、多くの環境では品質管理に必要な総計算量はかなりのものであり、最適化が必要である。Falcoは、FastQCのエミュレーションであり、同等の結果を生成しながら平均3倍速く動作する。FastQCと比較して、Falcoは実行に必要なメモリも少なく、HTMLレポートのより柔軟な視覚化を提供する。

インストール

ubuntu20とmacでテストした。

依存

zlib is required to read gzip compressed FASTQ files
htslib is required to process bam files

Github

https://github.com/smithlabcode/falco

#conda
mamba install -c bioconda falco -y

#source
git clone https://github.com/smithlabcode/falco.git
cd falco/
./autogen.sh
./configure CXXFLAGS="-O3 -Wall" --enable-hts
make HAVE_HTSLIB=1 all
make HAVE_HTSLIB=1 install

> falco -h

Usage: falco [OPTIONS] <seqfile1> <seqfile2> ...

Options:

-h, --help Print this help file and exit

-v, --version Print the version of the program and exit

-o, --outdir Create all output files in the specified

output directory. FALCO-SPECIFIC: If the

directory does not exists, the program will

create it. If this option is not set then

the output file for each sequence file is

created in the same directory as the

sequence file which was processed.

--casava [IGNORED BY FALCO] Files come from raw

casava output. Files in the same sample

group (differing only by the group number)

will be analysed as a set rather than

individually. Sequences with the filter flag

set in the header will be excluded from the

analysis. Files must have the same names

given to them by casava (including being

gzipped and ending with .gz) otherwise they

won't be grouped together correctly.

--nano [IGNORED BY FALCO] Files come from nanopore

sequences and are in fast5 format. In this

mode you can pass in directories to process

and the program will take in all fast5 files

within those directories and produce a

single output file from the sequences found

in all files.

--nofilter [IGNORED BY FALCO] If running with --casava

then don't remove read flagged by casava as

poor quality when performing the QC

analysis.

--extract [ALWAYS ON IN FALCO] If set then the zipped

output file will be uncompressed in the same

directory after it has been created. By

default this option will be set if fastqc is

run in non-interactive mode.

-j, --java [IGNORED BY FALCO] Provides the full path to

the java binary you want to use to launch

fastqc. If not supplied then java is assumed

to be in your path.

--noextract [IGNORED BY FALCO] Do not uncompress the

output file after creating it. You should

set this option if you do not wish to

uncompress the output when running in

non-interactive mode.

--nogroup Disable grouping of bases for reads >50bp.

All reports will show data for every base in

the read. WARNING: When using this option,

your plots may end up a ridiculous size. You

have been warned!

--min_length [NOT YET IMPLEMENTED IN FALCO] Sets an

artificial lower limit on the length of the

sequence to be shown in the report. As long

as you set this to a value greater or equal

to your longest read length then this will

be the sequence length used to create your

read groups. This can be useful for making

directly comaparable statistics from

datasets with somewhat variable read

lengths.

-f, --format Bypasses the normal sequence file format

detection and forces the program to use the

specified format. Validformats are bam, sam,

bam_mapped, sam_mapped, fastq, fq, fastq.gz

or fq.gz.

-t, --threads [NOT YET IMPLEMENTED IN FALCO] Specifies the

number of files which can be processed

simultaneously. Each thread will be

allocated 250MB of memory so you shouldn't

run more threads than your available memory

will cope with, and not more than 6 threads

on a 32 bit machine [1]

-c, --contaminants Specifies a non-default file which contains

the list of contaminants to screen

overrepresented sequences against. The file

must contain sets of named contaminants in

the form name[tab]sequence. Lines prefixed

with a hash will be ignored. Default:

/home/kazu/Documents/falco/Configuration/contaminant_list.txt

-a, --adapters Specifies a non-default file which contains

the list of adapter sequences which will be

explicity searched against the library. The

file must contain sets of named adapters in

the form name[tab]sequence. Lines prefixed

with a hash will be ignored. Default:

/home/kazu/Documents/falco/Configuration/adapter_list.txt

-l, --limits Specifies a non-default file which contains

a set of criteria which will be used to

determine the warn/error limits for the

various modules. This file can also be used

to selectively remove some modules from the

output all together. The format needs to

mirror the default limits.txt file found in

the Configuration folder. Default:

/home/kazu/Documents/falco/Configuration/limits.txt

-k, --kmers [IGNORED BY FALCO AND ALWAYS SET TO 7]

Specifies the length of Kmer to look for in

the Kmer content module. Specified Kmer

length must be between 2 and 10. Default

length is 7 if not specified.

-q, --quiet Supress all progress messages on stdout and

only report errors.

-d, --dir [IGNORED: FALCO DOES NOT CREATE TMP FILES]

Selects a directory to be used for temporary

files written when generating report images.

Defaults to system temp directory if not

specified.

-s, -subsample [Falco only] makes falco faster (but

possibly less accurate) by only processing

reads that are multiple of this value (using

0-based indexing to number reads). [1]

-b, -bisulfite [Falco only] reads are whole genome

bisulfite sequencing, and more Ts and fewer

Cs are therefore expected and will be

accounted for in base content.

-r, -reverse-complement [Falco only] The input is a

reverse-complement. All modules will be

tested by swapping A/T and C/G

-skip-data [Falco only] Do not create FastQC data text

file.

-skip-report [Falco only] Do not create FastQC report

HTML file.

-skip-summary [Falco only] Do not create FastQC summary

file

-D, -data-filename [Falco only] Specify filename for FastQC

data output (TXT). If not specified, it will

be called fastq_data.txt in either the input

file's directory or the one specified in the

--output flag. Only available when running

falco with a single input.

-R, -report-filename [Falco only] Specify filename for FastQC

report output (HTML). If not specified, it

will be called fastq_report.html in either

the input file's directory or the one

specified in the --output flag. Only

available when running falco with a single

input.

-S, -summary-filename [Falco only] Specify filename for the short

summary output (TXT). If not specified, it

will be called fastq_report.html in either

the input file's directory or the one

specified in the --output flag. Only

available when running falco with a single

input.

-K, -add-call [Falco only] add the command call call to

FastQC data output and FastQC report HTML

(this may break the parse of fastqc_data.txt

in programs that are very strict about the

FastQC output format).

Help options:

-?, -help print this help message

-about print about message

実行方法

fastqファイルを指定する（fastq, fq, fastq.gz or fq.gz）。

#１つのfastq
falco input.fq.gz

#複数fastq、#出力指定、4スレッド
falco -t 4 -o outdir sample*.fq.gz

-o Create all output files in the specified output directory. FALCO-SPECIFIC: If the directory does not exists, the program will create it. If this option is not set then the output file for each sequence file is created in the same directory as the sequence file which was processed.
-t Specifies the number of files which can be processed simultaneously. Each thread will be allocated 250MB of memory so you shouldn't run more threads than your available memory will cope with, and not more than 6 threads on a 32 bit machine [1]
-c Specifies a non-default file which contains the list of contaminants to screen overrepresented sequences against. The file must contain sets of named contaminants in the form name[tab]sequence. Lines prefixed with a hash will be ignored. Default: Configuration/contaminant_list.txt
-a Specifies a non-default file which contains the list of adapter sequences which will be explicity searched against the library. The file must contain sets of named adapters in the form name[tab]sequence. Lines prefixed with a hash will be ignored. Default: Configuration/adapter_list.txt

sam、bam (binary sam)のリードの分析にも対応している（htslibが必要）。

falco -t 4 -o outdir input.bam

-f Bypasses the normal sequence file format detection and forces the program to use the specified format. Valid formats are bam, sam, bam_mapped, sam_mapped, fastq, fq, fastq.gz or fq.gz.

出力

fastqc_data.txtはQCメトリクスの要約を含むテキストファイル
fastqc_report.htmlは、テキストサマリーのプロットを示すHTMLレポート
summary.txt：各モジュールのpass/warn/failの結果を記述したタブ区切りファイル

レポート

ロングリードの場合（デフォルト設定でONTを調べるとqualityはfailとなった）

（以下省略）

その他

fastQC互換なので、出力はそのままmultiqcで認識する。

引用
Falco: high-speed FastQC emulation for quality control of sequencing data

Guilherme de Sena Brandine 1, Andrew D Smith

F1000Res. 2019 Nov 7:8:1874

2024-04-19

KEGG KOデータベースでKO IDの機能的情報を取得する

KEGG KEGG pathway tips

タイトルの通りです。KO (KEGG Orthology) のリストから情報を取得するには、KO (KEGG ORTHOLOGY) Databaseのトップページにアクセスするのが手っ取り早いです。

https://www.genome.jp/kegg/ko.htmlにアクセスする。

KO IDを入力する。手持ちのKO IDのタイトル名をリスト形式で取得したいなら、下のメニューからGet titleを選択する。

Filterを押すと改行が除去されてスペース区切りになる。Get entryボタンは反応しない。

表示された。

上の画面に戻る。

Map pathwayを選択すると、KO IDのKEGG pathwayごとのアサイン数が表示される。

Briteタブに切り替えると、KEGG Briteの分類でのKO IDのアサイン数が表示される。

KEGG BRITE は分子間相互作用と反応に限定されたKEGG PATHWAY と比較して、以下のような多くの異なるタイプの関係を組み込んでいる：
1. 遺伝子とタンパク質
2. 化合物と反応
3. 薬物
4. 疾患
5. 生物とウイルス

（KEGG BRITEより）

KEGG moduleタブではKEGG module（Mで始まるID）ごとのKOアサインを確認できる。

show module listをクリックすると、moduleそれぞれにアサインされたKOも表示される。

トップに戻る。

Orthologue tableをクリックするとKEGG organismそれぞれでのKOアサインを確認できるが、リストが多いとかなり時間がかかる。

補足

KEGG mapperでは、KO IDの他、KEGG compound（低分子、生体高分子、および生物学的システムに関連する他の化学物質のデータベース、Cで始まる）、RCLASS（主に酵素反応のデータベース、Rで始まる）のIDから、アサインされるKO mofuleやKEGG pathway、KEGG BRITE、KEGG BRITE tableを検索できる。

https://www.genome.jp/kegg/mapper/search.html

複数のIDタイプがあっても同時に検索できる。画像ではKO IDとKEGG compound ID（Cで始まる）が入力されている。

なお、KEGG mapperは、KAAS（紹介）やKofamKOALA（紹介）などのKO アサイン結果を入力して、アサインされていKEGG pathway情報などの一覧を取得するのが代表的な使い方かと思います。特に目新しい情報ではないのですが、以前説明して助かったと言われていた方がいたので簡単に紹介しました。

引用

KEGG as a reference resource for gene and protein annotation
Minoru Kanehisa, Yoko Sato, Masayuki Kawashima, Miho Furumichi, Mao Tanabe

Nucleic Acids Res. 2016 Jan 4;44(D1):D457-62

2024-04-19

NCBI SRAで検索する時のtips

NCBI tips SRA

NCBI SRAでは公開されているシークエンシングデータを検索し、必要であればダウンロードできる。

metagenomeと検索してみると4,566,384件ヒットした（2024年4月実行）。

metagenomeと検索したが、16Sがタイトルに含まれるシークエンシングデータがトップヒットしている。meta16Sはメタゲノムでないが、metagenomeの定義が曖昧であるために、このように関係ないデータがヒットする事がある。

このような時はNOTで絞り込む。”metagenome NOT 16S”と検索したところ、2,656,920ヒットに減ったが、今度はampliconがトップヒットしている。

ampliconも除外する。”metagenome NOT 16S NOT amplicon"と検索すると850,797ヒットにまで減った。

論理演算子として、NOT以外にANDやORが利用できる。土壌のメタゲノムに関心があるとして、soilを加える。87,578ヒットまで減った。

NCBI SRA Run Selectorでさらに絞り込むには、2万以下（数値は間違っているかもしれません）のヒットまで減らす必要があるので、もう少し絞り込む。

左のメニューから、DNA、fastqを選択した。メタゲノムアセンブリ後のbinningにはロングリードアセンブリが有効で、さらに距離のある近縁な株間の配列を区別して組み立てるにはエラーの少ないpacbioが最適なので、ここではpacbioに限定した。314ヒットになった。ロングリードのメタゲノムは希少であることが分かる。

数が減ったのでRun selecterのリンクが出現している。

メタデータも確認したいのでRun selecterにジャンプし、アクセッションIDとそのメタデータテキストを取得する。

補足

検索ウィンドウ下のadvancedでは、Bioproject、Biosample、生物名、出版年、登録者名などで絞り込む事もできます。

Run selecterは以前に簡単に紹介しています。

https://kazumaxneo.hatenablog.com/entry/2022/09/20/233140

参考

https://www.lib.m.u-tokyo.ac.jp/manual/pubmedmanual.pdf

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

細菌の保存されたタンパク質の割合を計算するためのNextflowパイプライン POCP-nf

メタゲノムアセンブリの高精度なbin refinementツール Binette

複雑な反復配列を迅速にインタラクティブなドットプロットで可視化する ModDotPlot

（ヒトとマウス）仮説生成のためにクエリに最も類似した遺伝子発現シグネチャーを検索する RummaGEO

FastQCの高速な代替 Falco

KEGG KOデータベースでKO IDの機能的情報を取得する

NCBI SRAで検索する時のtips