这个也是在现实科研中的需求,看似简单,实际上也得动动脑子。另外,掌握了这种方法,我们可以对源码中的任意满足条件的token进行替换。
这篇博客是承接上一篇的:https://blog.csdn.net/qysh123/article/details/110849387,不过稍微有所改进。上篇博客中,由于有命令空间(namespace)的存在,所以用了模糊匹配的方法,这里我们参考其他朋友的方法:https://blog.csdn.net/weixin_45069542/article/details/90229654,来直接解析命名空间。还是一样,以下面这段Java代码(假设这段代码的文件名为Example1.java)为例:
public static long toLong(byte[] a) {long x = 0;for (int i = 0; i < 8; i++) {int j = (7 - i) << 3;x |= ((0xFFL << j) & ((long) a[i] << j));}return x;//test}
对于这段代码,我们想把a替换成:FPARAM,那应该怎么做呢,如果用上篇博客说的javalang:
import javalangthis_file=open("Example1.java",'r')
file_content=this_file.read()
tokens = list(javalang.tokenizer.tokenize(file_content))
for each_token in tokens:print(each_token)
我们得到的是下面这样的结果:
Modifier "public" line 1, position 3
Modifier "static" line 1, position 10
BasicType "long" line 1, position 17
Identifier "toLong" line 1, position 22
Separator "(" line 1, position 28
BasicType "byte" line 1, position 29
Separator "[" line 1, position 33
Separator "]" line 1, position 34
Identifier "a" line 1, position 36
Separator ")" line 1, position 37
Separator "{" line 1, position 39
BasicType "long" line 2, position 5
Identifier "x" line 2, position 10
Operator "=" line 2, position 12
DecimalInteger "0" line 2, position 14
Separator ";" line 2, position 15
Keyword "for" line 3, position 5
Separator "(" line 3, position 9
BasicType "int" line 3, position 10
Identifier "i" line 3, position 14
Operator "=" line 3, position 16
DecimalInteger "0" line 3, position 18
Separator ";" line 3, position 19
Identifier "i" line 3, position 21
Operator "<" line 3, position 23
DecimalInteger "8" line 3, position 25
Separator ";" line 3, position 26
Identifier "i" line 3, position 28
Operator "++" line 3, position 29
Separator ")" line 3, position 31
Separator "{" line 3, position 33
BasicType "int" line 4, position 7
Identifier "j" line 4, position 11
Operator "=" line 4, position 13
Separator "(" line 4, position 15
DecimalInteger "7" line 4, position 16
Operator "-" line 4, position 18
Identifier "i" line 4, position 20
Separator ")" line 4, position 21
Operator "<<" line 4, position 23
DecimalInteger "3" line 4, position 26
Separator ";" line 4, position 27
Identifier "x" line 5, position 7
Operator "|=" line 5, position 9
Separator "(" line 5, position 12
Separator "(" line 5, position 13
HexInteger "0xFFL" line 5, position 14
Operator "<<" line 5, position 20
Identifier "j" line 5, position 23
Separator ")" line 5, position 24
Operator "&" line 5, position 26
Separator "(" line 5, position 28
Separator "(" line 5, position 29
BasicType "long" line 5, position 30
Separator ")" line 5, position 34
Identifier "a" line 5, position 36
Separator "[" line 5, position 37
Identifier "i" line 5, position 38
Separator "]" line 5, position 39
Operator "<<" line 5, position 41
Identifier "j" line 5, position 44
Separator ")" line 5, position 45
Separator ")" line 5, position 46
Separator ";" line 5, position 47
Separator "}" line 6, position 5
Keyword "return" line 7, position 5
Identifier "x" line 7, position 12
Separator ";" line 7, position 13
Separator "}" line 8, position 3
可以看到,a和x类型都是Identifier,这时候我们的分析程序是没法确定哪个是方法的参数的(顺便说一下,Understand得到的也是这种粒度)。
所以这种情况下还是得使用srcML:
如果直接在命令行下运行:
srcml.exe Example1.java > test.xml
得到的xml文件是:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<unit xmlns="http://www.srcML.org/srcML/src" revision="0.9.5" language="Java" filename="Example1.java"> <function><specifier>public</specifier> <specifier>static</specifier> <type><name>long</name></type> <name>toLong</name><parameter_list>(<parameter><decl><type><name><name>byte</name><index>[]</index></name></type> <name>a</name></decl></parameter>)</parameter_list> <block>{<decl_stmt><decl><type><name>long</name></type> <name>x</name> <init>= <expr><literal type="number">0</literal></expr></init></decl>;</decl_stmt><for>for <control>(<init><decl><type><name>int</name></type> <name>i</name> <init>= <expr><literal type="number">0</literal></expr></init></decl>;</init> <condition><expr><name>i</name> <operator><</operator> <literal type="number">8</literal></expr>;</condition> <incr><expr><name>i</name><operator>++</operator></expr></incr>)</control> <block>{<decl_stmt><decl><type><name>int</name></type> <name>j</name> <init>= <expr><operator>(</operator><literal type="number">7</literal> <operator>-</operator> <name>i</name><operator>)</operator> <operator><<</operator> <literal type="number">3</literal></expr></init></decl>;</decl_stmt><expr_stmt><expr><name>x</name> <operator>|=</operator> <operator>(</operator><operator>(</operator><literal type="number">0xFFL</literal> <operator><<</operator> <name>j</name><operator>)</operator> <operator>&</operator> <operator>(</operator><operator>(</operator><name>long</name><operator>)</operator> <name><name>a</name><index>[<expr><name>i</name></expr>]</index></name> <operator><<</operator> <name>j</name><operator>)</operator><operator>)</operator></expr>;</expr_stmt>}</block></for><return>return <expr><name>x</name></expr>;</return><comment type="line">//test</comment>}</block></function></unit>
这时候就得仔细观察一下怎么用XPath定位到a了,其实观察的方法和之前用scrapy的时候类似:https://blog.csdn.net/qysh123/article/details/106655644
其实我们可以看到,如果用/function/parameter_list/parameter/decl/name这个路径就可以直接定位到a,如果要这样匹配,就得使用带namespace的XPath匹配方法,具体可以参考:https://blog.csdn.net/weixin_45069542/article/details/90229654
下面给出代码,是不是很简单:
import subprocess
from lxml import etreeoutput=subprocess.run(['srcml.exe', 'Example1.java'], capture_output=True, check=False)
root=etree.fromstring(output.stdout)for func in root.xpath('//x:function',namespaces={'x':'http://www.srcML.org/srcML/src'}):func_name = func.xpath('./x:name/text()',namespaces={'x':'http://www.srcML.org/srcML/src'})[0]content = func.xpath('.//text()')content = [str(v).strip() for v in content]content = list(filter(None, content))print(content)parameter_list=[]for parameter in func.xpath('./x:parameter_list/x:parameter',namespaces={'x':'http://www.srcML.org/srcML/src'}):for name in parameter.xpath('./x:decl/x:name/text()',namespaces={'x':'http://www.srcML.org/srcML/src'}):parameter_list.append(name)normalized_list=[]for each_token in content:if(each_token in parameter_list):normalized_list.append('FPARAM')else:normalized_list.append(each_token)print(normalized_list)
两次print的输出分别是:
['public', 'static', 'long', 'toLong', '(', 'byte', '[]', 'a', ')', '{', 'long', 'x', '=', '0', ';', 'for', '(', 'int', 'i', '=', '0', ';', 'i', '<', '8', ';', 'i', '++', ')', '{', 'int', 'j', '=', '(', '7', '-', 'i', ')', '<<', '3', ';', 'x', '|=', '(', '(', '0xFFL', '<<', 'j', ')', '&', '(', '(', 'long', ')', 'a', '[', 'i', ']', '<<', 'j', ')', ')', ';', '}', 'return', 'x', ';', '//test', '}']
['public', 'static', 'long', 'toLong', '(', 'byte', '[]', 'FPARAM', ')', '{', 'long', 'x', '=', '0', ';', 'for', '(', 'int', 'i', '=', '0', ';', 'i', '<', '8', ';', 'i', '++', ')', '{', 'int', 'j', '=', '(', '7', '-', 'i', ')', '<<', '3', ';', 'x', '|=', '(', '(', '0xFFL', '<<', 'j', ')', '&', '(', '(', 'long', ')', 'FPARAM', '[', 'i', ']', '<<', 'j', ')', ')', ';', '}', 'return', 'x', ';', '//test', '}']
说实话确实挺方便的。用这种方法,我们基本上可以匹配到源代码(C++,Java,Python srcML均可以处理)中满足任意条件的token。