Mismatch K-tuple
Mismatch K-tuple表明连续 k 元组中允许出现m(m<k)个错误
说法一
StackTADB: a stacking-based ensemble learning model for predicting the boundaries of topologically associating domains (TADs) accurately in fruit flies
特征表示:
fkMis={{Ak}1,{Ak?1T1}1,{Ak?2T2}1,...,{Ck}1}等价于:{{AAAA}1,{AAAT}1,...,{GGGG}1,...,{CCCC}1}f^{Mis}_k=\{\{A^k\}_1,\{A^{k-1}T^1\}_1,\{A^{k-2}T^2\}_1,...,\{C^k\}_1\}\\ {等价于:}\\ \{\{AAAA\}_1,\{AAAT\}_1,...,\{GGGG\}_1,...,\{CCCC\}_1\} fkMis?={
{
Ak}1?,{
Ak?1T1}1?,{
Ak?2T2}1?,...,{
Ck}1?}等价于:{
{
AAAA}1?,{
AAAT}1?,...,{
GGGG}1?,...,{
CCCC}1?}
变量说明:
kkk:K-mers的窗口大小
{Ak}1\{A^k\}_1{ Ak}1?:表示在{Ak}\{A^k\}{ Ak}中允许有一个错误
特征向量维数:4k4^k4k
**其他操作:**可以选择所有特征中计数最高的前N个(如:600)作为最终特征
说法二
2021-04_bio_iEnhancer-XG:interpretable sequence-based enhancers and their strength predictor
特征表示:
fk,mmis(x)=(∑j=0mc1,j,∑j=0mc2,j,...,∑j=0mc4k,j)f^{mis}_{k,m}(x)=(\sum^m_{j=0}c_{1,j},\sum_{j=0}^mc_{2,j},...,\sum_{j=0}^mc_{4^k,j}) fk,mmis?(x)=(j=0∑m?c1,j?,j=0∑m?c2,j?,...,j=0∑m?c4k,j?)
变量说明:
ci,jc_{i,j}ci,j? is the occurrence of the iiith K-mer type in xxx?, only jjj does not match, i=1,2,3,…,4ki =1, 2, 3, …,4^ki=1,2,3,…,4k and j=0,1,…,mj = 0, 1,…, mj=0,1,…,m?.
xxx就是序列
∑j=0mc1,j\sum^m_{j=0}c_{1,j}∑j=0m?c1,j?:jjj从0累加到m,统计从不出错到m个错误的kmer的出现总和。c1,0c_{1,0}c1,0?就统计第一种类型的kmer在不出错的情况下的出现次数,c1,1c_{1,1}c1,1?就统计第一种类型的kmer在容忍一个碱基出错的情况下的出现次数
demo
当k=4k=4k=4时,此时会产生444^444=64维特征向量
总结
一般有两种做法,一种是就单独统计m,另一种就是从0累加到m进行求和统计。
发散:可以从0累加到m进行拼接统计