请参考数据格式:Data Format — HanLP documentation
以及依存句法分析资料:《自然语言处理入门》
简单地说,二元组第一个数字是中心词(父节点)在tok数组中的下标+1,下标0表示root。在tok数组中取中心词的时候记得减一,或者把root插入到tok数组的开头。第二个是依存关系标签,中文对应:Stanford Dependencies Chinese — HanLP documentation
你看的数据不对,总共31个单词,下标不可能达到38,应该是其他的句子的分析结果。正确数据是:
{
"tok/fine": [
"2019年",
"4月",
"1日",
"上午",
",",
"在",
"长江口",
"绿华山",
"南锚地",
"水域",
",",
"一",
"艘",
"货船",
"厨房",
"失火",
",",
"火势",
"无法",
"控制",
",",
"船",
"上",
"共",
"17",
"名",
"船员",
"随",
"船",
"遇险",
"。"
],
"dep": [
[4, "nn"],
[4, "nn"],
[4, "nn"],
[16, "tmod"],
[16, "punct"],
[16, "prep"],
[10, "nn"],
[10, "nn"],
[10, "nn"],
[6, "pobj"],
[16, "punct"],
[13, "nummod"],
[14, "clf"],
[15, "nn"],
[16, "nsubj"],
[0, "root"],
[16, "punct"],
[20, "nsubj"],
[20, "advmod"],
[16, "conj"],
[16, "punct"],
[23, "lobj"],
[27, "dep"],
[26, "advmod"],
[26, "nummod"],
[27, "clf"],
[30, "nsubj"],
[30, "prep"],
[28, "pobj"],
[16, "conj"],
[16, "punct"]
]
}
你可以调用doc.to_conll().to_markdown()
将其转化为markdown:
ID |
FORM |
LEMMA |
UPOS |
XPOS |
FEATS |
HEAD |
DEPREL |
DEPS |
MISC |
1 |
2019年 |
_ |
_ |
_ |
_ |
4 |
nn |
_ |
_ |
2 |
4月 |
_ |
_ |
_ |
_ |
4 |
nn |
_ |
_ |
3 |
1日 |
_ |
_ |
_ |
_ |
4 |
nn |
_ |
_ |
4 |
上午 |
_ |
_ |
_ |
_ |
16 |
tmod |
_ |
_ |
5 |
, |
_ |
_ |
_ |
_ |
16 |
punct |
_ |
_ |
6 |
在 |
_ |
_ |
_ |
_ |
16 |
prep |
_ |
_ |
7 |
长江口 |
_ |
_ |
_ |
_ |
10 |
nn |
_ |
_ |
8 |
绿华山 |
_ |
_ |
_ |
_ |
10 |
nn |
_ |
_ |
9 |
南锚地 |
_ |
_ |
_ |
_ |
10 |
nn |
_ |
_ |
10 |
水域 |
_ |
_ |
_ |
_ |
6 |
pobj |
_ |
_ |
11 |
, |
_ |
_ |
_ |
_ |
16 |
punct |
_ |
_ |
12 |
一 |
_ |
_ |
_ |
_ |
13 |
nummod |
_ |
_ |
13 |
艘 |
_ |
_ |
_ |
_ |
14 |
clf |
_ |
_ |
14 |
货船 |
_ |
_ |
_ |
_ |
15 |
nn |
_ |
_ |
15 |
厨房 |
_ |
_ |
_ |
_ |
16 |
nsubj |
_ |
_ |
16 |
失火 |
_ |
_ |
_ |
_ |
0 |
root |
_ |
_ |
17 |
, |
_ |
_ |
_ |
_ |
16 |
punct |
_ |
_ |
18 |
火势 |
_ |
_ |
_ |
_ |
20 |
nsubj |
_ |
_ |
19 |
无法 |
_ |
_ |
_ |
_ |
20 |
advmod |
_ |
_ |
20 |
控制 |
_ |
_ |
_ |
_ |
16 |
conj |
_ |
_ |
21 |
, |
_ |
_ |
_ |
_ |
16 |
punct |
_ |
_ |
22 |
船 |
_ |
_ |
_ |
_ |
23 |
lobj |
_ |
_ |
23 |
上 |
_ |
_ |
_ |
_ |
27 |
dep |
_ |
_ |
24 |
共 |
_ |
_ |
_ |
_ |
26 |
advmod |
_ |
_ |
25 |
17 |
_ |
_ |
_ |
_ |
26 |
nummod |
_ |
_ |
26 |
名 |
_ |
_ |
_ |
_ |
27 |
clf |
_ |
_ |
27 |
船员 |
_ |
_ |
_ |
_ |
30 |
nsubj |
_ |
_ |
28 |
随 |
_ |
_ |
_ |
_ |
30 |
prep |
_ |
_ |
29 |
船 |
_ |
_ |
_ |
_ |
28 |
pobj |
_ |
_ |
30 |
遇险 |
_ |
_ |
_ |
_ |
16 |
conj |
_ |
_ |
31 |
。 |
_ |
_ |
_ |
_ |
16 |
punct |
_ |
_ |
显然root 失火(ID16)的子节点有head==16的“上午”“,”等等单词。完整代码:
from hanlp_restful import HanLPClient
HanLP = HanLPClient('https://www.hanlp.com/api',
language='zh') # auth不填则匿名,zh中文,mul多语种
doc = HanLP.parse('2019年4月1日上午,在长江口绿华山南锚地水域,一艘货船厨房失火,火势无法控制,船上共17名船员随船遇险。', tasks='dep')
doc = doc.squeeze()
print(doc)
print(doc.to_conll().to_markdown())
# doc.pretty_print()