参考:
项目原理:
实验基于简单共现关系,编写 Python 代码从纯文本中提取出人物关系网络,并用Gephi 将生成的网络可视化。下面介绍共现网络的基本原理。( )
共现网络的基本原理:
实体间的共现是是一种基于统计信息的提取,关系密切的人物往往会在文中的多段连续出现,通过文中以出现的实体(人名),计算不同实体共同出现的比率和次数,设定一个阈值,大于该阈值认为实体间存在某种联系。
准备:
- 环境 windows Python3.6
- 模块jieba
- jephi软件
人名字典
《釜山行》中文剧本
代码:
# -*- coding: utf-8 -*-import
os, sys import jieba, codecs, math import jieba.posseg as pseg names = {} # 姓名字典 relationships = {} # 关系字典 lineNames = [] # 每段内人物关系 # count names jieba.load_userdict("D:\\ResearchContent\\Exercise_Programm\\PythonExercise\\Python\\dict.txt")
# 加载字典with
codecs.open("D:\\ResearchContent\\Exercise_Programm\\PythonExercise\\Python\\fushan.txt", "r", "utf8") as f
: for
line in f.readlines()
:
poss = pseg.cut(line)
# 分词并返回该词词性
lineNames.append([])
# 为新读入的一段添加人物名称列表for
w in poss
: if
w.flag
!= "nr" or len
(w.word)
< 2: continue # 当分词长度小于2或该词词性不为nr时认为该词不为人名
lineNames[
-1
].append(w.word)
# 为当前段的环境增加一个人物if
names.get(w.word)
is None:
names[w.word] =
0
relationships[w.word] = {} names[w.word]
+= 1
# 该人物出现次数加 1
# explore relationshipsfor
line in lineNames:
# 对于每一段for
name1 in line
: for
name2 in line:
# 每段中的任意两个人if
name1 == name2:
continue if
relationships[name1].get(name2) is None:
# 若两人尚未同时出现则新建项
relationships[name1][name2]=
1else:
relationships[name1][name2] = relationships[name1][name2]
+ 1
# 两人共同出现次数加 1
# outputwith
codecs.open("busan_node.txt", "w", "gbk") as f
:
f.write("Id Label Weight\r\n") for name, times in names.items()
:
f.write(name
+ " " +
name
+ " " + str
(times)
+ "
\r\n") with codecs.open("busan_edge.txt", "w", "gbk") as f
:
f.write("Source Target Weight\r\n") for name, edges in relationships.items()
: for
v, w in edges.items()
: if
w
> 3:
f.write(name
+ " " +
v
+ " " + str
(w)
+ "
\r\n")
参考:
共线网络简单英文介绍
Python中文分词:结巴分词
import as 解释:
修改2