码迷,mamicode.com
首页 > 其他好文 > 详细

find_circ的使用教程

时间:2019-12-02 23:32:47      阅读:412      评论:0      收藏:0      [点我收藏+]

标签:sam   rcu   fetch   com   mapping   list   map   efi   样本   

1.find_circ的安装

#find_circ需要运行在装有python 2.7的64位系统上,同时需要安装numpy和pysam这两个python模块。其运行需要借助bowtie2和samtools来完成基因组mapping的过程。

wget https://github.com/marvin-jens/find_circ/archive/v1.2.tar.gz

 

tar -xzvf v1.2.tar.gz

 

2.参考基因组的下载

#通过fetch_ucsc.py下载ucsc最新版本的参考基因组

fetch_ucsc.py hg19/hg38/mm9/mm10 ref/kg/ens/fa out

 

3.bowtie2建立参考基因组索引

bowtie2_build hg38.fa hg38

 

4.基于RNA-Seq的基因组比对(pair-end模式)

###bowtie2参数介绍###

-p 使用多线程;--very-sensitive 允许多重比对,报告出最好的一个;--score-min=C,-15,0 设置比对分数函数;--mm 设置I/O模式。

###samtools view参数介绍###

-h 文件包含header line;-b 输出bam格式;-u 输出非压缩的bam格式 –S 忽略版本兼容

bowtie2 -p 16 --very-sensitive --score-min=C,-15,0 --mm -x /path/to/bowtie2_index -q -1 reads1.fq -2 reads2.fq  | samtools view -hbuS - | samtools sort - -o output.bam
 
5.挑出没有比对上的序列,各取两头20bp短序列(anchor)

samtools view -hf 4 output.bam | samtools view -Sb - > unmapped.bam

python unmapped2anchors.py unmapped.bam | gzip > anchors.qfa.gz

 

6.根据anchor比对基因组情况寻找潜在的circRNA

 

bowtie2 -p 16 --reorder --mm -M20 --score-min=C,-15,0 -q -x /path/to/bowtie2_index -U anchors.qfa.gz | python find_circ.py -G /path/to/hg38.fa -p prefix -s find_circ.sites.log > find_circ.sites.bed 2 > find_circ.sites.reads

###根据以下规则对结果进行筛选

1.根据关键词CIRCULAR筛选环状RNA

2.去除线粒体上的环状RNA

3.筛选unique junction reads数至少为2的环状RNA

4.去除断裂点不明确的环状RNA

5.过滤掉长度大于100kb的circRNA,这里的100kb为基因组长度,直接用环状RNA的头尾相减即可

 

grep CIRCULAR find_circ.sites.bed | grep -v chrM | gawk ‘$5>=2‘ | grep UNAMBIGUOUS_BP | grep ANCHOR_UNIQUE | $path/maxlength.py 100000 > finc_circ.candidates.bed
 

7.分析多个样本

#如果有多个样本,需要分别用find_circ.py运行,然后将各自的结果合并
 

./merge_bed.py sample1.bed sample2.bed [...] > combined.bed

 

8.输出的文件格式

#前六列为标准的BED文件格式,剩余的12列关于junction的一些信息

 

columnnamedescription
1 chrom chromosome/contig name
2 start left splice site (zero-based)
3 end right splice site (zero-based).(Always: end > start. 5‘ 3‘ depends on strand)
4 name (provisional) running number/name assigned to junction
5 n_reads number of reads supporting the junction (BED ‘score‘)
6 strand genomic strand (+ or -)
7 n_uniq number of distinct read sequences supporting the junction
8 uniq_bridges number of reads with both anchors aligning uniquely
9 best_qual_left alignment score margin of the best anchor alignment supporting the left splice junction (max=2 * anchor_length)
10 best_qual_right same for the right splice site
11 tissues comma-separated, alphabetically sorted list of supporting the left splice junction (max=2 * anchor_length)
12 tiss_counts comma-separated list of corresponding read-counts
13 edits number of mismatches in the anchor extension process
14 anchor_overlap number of nucleotides the breakpoint resides within one anchor
15 breakpoints number of alternative ways to break the read with flanking GT/AG
16 signal flanking dinucleotide splice signal (normally GT/AG)
17 strandmatch ‘MATCH‘, ‘MISMATCH‘ or ‘NA‘ for non-stranded analysis
18 category list of keywords describing the junction. Useful for quick grep filtering

 

2019-12-02
22:20:44

find_circ的使用教程

标签:sam   rcu   fetch   com   mapping   list   map   efi   样本   

原文地址:https://www.cnblogs.com/yanjiamin/p/11973687.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!