天天看点

基于graphsage的欺诈用户风险识别图技术前言一、赛题说明二、代码说明

图技术

利用neo4j、networkx、dgl、python做图分析挖掘

【1】最短路径算法dijkstra

【2】基于networkx的隐性集团关系识别模型

【3】基于Neo4j的担保社群型态分析挖掘

【4】基于python求有向无环图中target到其他节点全路径

【5】有向图中任意两点的路径

【6】图基础入门

【7】知识图谱快速入门

基于graphsage的欺诈用户风险识别

  • 图技术
  • 前言
  • 一、赛题说明
    • 1.数据描述
    • 2.预测任务
    • 3.数据下载
  • 二、代码说明
    • 1.数据处理
    • 2.转换为图数据
    • 3.模型训练及预测

前言

参加了第七届信也杯,记录下如何使用graphsage、neo4j进行图分析挖掘的。反欺诈是金融行业永恒的主题,在互联网金融信贷业务中,数字金融反欺诈技术已经得到广泛应用并取得良好效果,这其中包括了近几年迅速发展并在各个领域得到越来越广泛应用的图神经网络。本届大赛的主办方信也科技,是一家致力于通过大数据、人工智能等先进技术,为互联网信贷用户与金融机构提供桥接平台的金融科技集团。在本届大赛中,信也科技以互联网智能风控为背景,从用户相互关联和影响的视角,探索满足风控反欺诈领域需求的,可拓展、高效的图神经网络应用方案,从而帮助更好地识别欺诈用户。

一、赛题说明

1.数据描述

本届大赛的初赛和复赛各提供一个脱敏的全连通的社交网络有向动态图,分别抽样于信也科技公司不同业务时间段的数据。在本届大赛提供的图数据中,节点代表信也科技的注册用户,从节点A指向节点B的有向边代表用户A将用户B填为他的紧急联系人。图中的边有不同的类型,代表了对紧急联系人的不同分类。图中的边上带有创建日期信息,初赛和复赛的两张图中,边的创建日期分别脱敏成从1开始的正整数,时间单位为天。另外,初赛和复赛的两张动态图均经过采样且包含小部分噪音。

2.预测任务

本届大赛的预测任务为识别欺诈用户的节点预测任务。虽然本届大赛的图数据中有四类节点,但是预测任务只需要将欺诈用户(Class 1)从正常用户(Class 0)中区分出来;这两类节点被称为前景节点。图中另外两类用户(Class 2和 Class 3)尽管在数目上占据更大的比例,但是他们的分类与用户是否欺诈无关,因此预测任务不包含这两类节点;这两类节点被称为背景节点。

与常规的结构化数据不同,图算法可以通过研究对象之间的复杂关系来提高模型预测效果。而本届大赛除了提供前景节点之间的社交关系,还提供了大量的背景节点。希望选手可以充分挖掘各类用户之间的关联和影响力,提出可拓展、高效的图神经网络模型,将隐藏在正常用户中的欺诈用户识别出来。

3.数据下载

数据下载

二、代码说明

  1. phase1_gdata.npz这个数据需要下载;
  2. neo4j的版本是4.48;

Version: 4.4.8

Edition: Community

Name: neo4j

  1. neo4j数据导入时,需要创建索引,提高导数的效率;
  1. neo4j图算法的使用是gds。

1.数据处理

#导入模块包
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
import os
import string
import tqdm
import time
import datetime
import torch
import torch as th
#创建目录存放进入neo4j的数据
neo4jpath = 'neo4j'
if not os.path.exists(neo4jpath):
    os.mkdir(neo4jpath)
path = 'data'
dataname = 'phase1_gdata.npz'
pathname = os.path.join(path, dataname)
item = np.load(pathname)

#训练测试划分
train_mask_t = item['train_mask']
np.random.shuffle(train_mask_t)
train_mask = train_mask_t[:int(len(train_mask_t)/10*6)]
valid_mask = train_mask_t[int(len(train_mask_t)/10*6):]
test_mask = item['test_mask']
train_idx = torch.tensor(train_mask, dtype=torch.int64)
val_idx = torch.tensor(valid_mask, dtype=torch.int64)
test_idx = torch.tensor(test_mask, dtype=torch.int64)

#原始数据
nums = ['x', 'y', 'edge_index', 'edge_type', 'edge_timestamp', 'train_mask', 'test_mask']
#增加列名
words = string.ascii_uppercase[0:17]
words = [i for i in words]
x = pd.DataFrame(item['x'])
x.columns = words

#增加主键
x['object_key'] = [i for i in range(len(x))]
#增加标签
x['type'] = item['y']

#节点数据
node = x.copy()
#关系数据
rela = pd.DataFrame(item['edge_index'])
rela.columns = ['start', 'target']
rela['edge_type'] = [int(i) for i in item['edge_type']]
rela['edge_timestamp'] = [int(i) for i in item['edge_timestamp']]
node.to_csv('neo4j/node.csv', encoding='utf-8', index=False)
rela.to_csv('neo4j/rela.csv', encoding='utf-8', index=False)

from py2neo import Graph, Node, Relationship
def connect_graph():
    graph = Graph('http://localhost:7474/browser/', user = 'neo4j', password = '****')
    return graph
graph = connect_graph()
def create_graph():
    #删除历史数据
    #graph.run("match p = ()-[r]->() delete p")
    #graph.run("match (n) delete n")
    #删除索引
    #graph.run("DROP INDEX ON:Cust(object_key)")
    #创建索引
    #graph.run("CREATE INDEX ON:Cust(object_key)")
    #导入节点数据
    rootpath = os.getcwd()
    load_node_path = os.path.join(rootpath, 'neo4j', 'node.csv')
    
    starttime = datetime.datetime.now()
    graph.run("USING PERIODIC COMMIT 1000 LOAD CSV WITH HEADERS FROM 'file://%s' AS line MERGE (p:Cust{object_key:line.object_key, type:line.type, A: line.A, B: line.B, C: line.C, D: line.D, E: line.E, F: line.F, G: line.G, H: line.H, I: line.I, J: line.J, K: line.K, L: line.L, M: line.M, N: line.N, O: line.O, P: line.P, Q: line.Q}) ON CREATE SET p.object_key=line.object_key, p.type = line.type ON MATCH SET p.type = line.type WITH p RETURN count(p)" % load_node_path)
    endtime = datetime.datetime.now()
    print('耗时%ss'%((endtime - starttime).seconds))
    print("%s INFO : 加载初始%s完毕。" % (time.ctime(), load_node_path))
    #导入关系数据
    load_rela_path = os.path.join(rootpath, 'neo4j', 'rela.csv')
    starttime = datetime.datetime.now()
    graph.run("USING PERIODIC COMMIT 1000 LOAD CSV WITH HEADERS FROM 'file://%s' AS line match (s:Cust{object_key:line.start}),(t:Cust{object_key:line.target}) MERGE (s)-[r:rela{edge_type:toInteger(line.edge_type), edge_timestamp:toInteger(line.edge_timestamp)}]->(t) ON CREATE SET r.edge_type = toInteger(line.edge_type), r.edge_timestamp = toInteger(line.edge_timestamp) ON MATCH SET r.edge_timestamp = toInteger(line.edge_timestamp)" % load_rela_path)
    endtime = datetime.datetime.now()
    print('耗时%ss'%((endtime - starttime).seconds))
    print("%s INFO : 加载初始%s完毕。" % (time.ctime(), load_rela_path))
create_graph()

#下面用gds获取图上特征,首先创建一个子图
starttime = datetime.datetime.now()
create_graph = pd.DataFrame(graph.run("CALL gds.graph.project('myGraph','Cust',{rela: {orientation: 'REVERSE', properties: ['edge_type']}})").data())
endtime = datetime.datetime.now()
print('耗时%ss'%((endtime - starttime).seconds))
#indegree
starttime = datetime.datetime.now()
indegree = pd.DataFrame(graph.run("CALL gds.degree.stream('myGraph') YIELD nodeId, score RETURN gds.util.asNode(nodeId).object_key as object_key, score AS indegree").data())
endtime = datetime.datetime.now()
print('indegree耗时%ss'%((endtime - starttime).seconds))
#weightedFollowers
starttime = datetime.datetime.now()
weightedFollowers = pd.DataFrame(graph.run("CALL gds.degree.stream('myGraph', { relationshipWeightProperty: 'edge_type' }) YIELD nodeId, score RETURN gds.util.asNode(nodeId).object_key AS object_key, score AS weightedFollowers").data())
endtime = datetime.datetime.now()
print('weightedFollowers耗时%ss'%((endtime - starttime).seconds))
#outdegree
starttime = datetime.datetime.now()
outdegree = pd.DataFrame(graph.run("CALL gds.degree.stream('myGraph', { orientation: 'REVERSE' }) YIELD nodeId, score RETURN gds.util.asNode(nodeId).object_key AS object_key, score AS outdegree").data())
endtime = datetime.datetime.now()
print('outdegree耗时%ss'%((endtime - starttime).seconds))

#pagerank
starttime = datetime.datetime.now()
pagerank = pd.DataFrame(graph.run("CALL gds.pageRank.stream('myGraph') YIELD nodeId, score RETURN gds.util.asNode(nodeId).object_key AS object_key, score as pagerank").data())
endtime = datetime.datetime.now()
print('pagerank耗时%ss'%((endtime - starttime).seconds))

#articleRank
starttime = datetime.datetime.now()
articleRank = pd.DataFrame(graph.run("CALL gds.articleRank.stream('myGraph') YIELD nodeId, score RETURN gds.util.asNode(nodeId).object_key AS object_key, score as articleRank").data())
endtime = datetime.datetime.now()
print('articleRank耗时%ss'%((endtime - starttime).seconds))

#louvain
starttime = datetime.datetime.now()
louvain = pd.DataFrame(graph.run("CALL gds.louvain.stream('myGraph') YIELD nodeId, communityId, intermediateCommunityIds RETURN gds.util.asNode(nodeId).object_key as object_key, communityId").data())
endtime = datetime.datetime.now()
print('耗时%ss'%((endtime - starttime).seconds))

louvain['counts'] = 1
louvain_count = louvain.groupby(['communityId']).count().reset_index()[['communityId', 'counts']]
louvain_count.head()
louvain_count = pd.merge(louvain_count, louvain[['object_key', 'communityId']], how = 'left', on = 'communityId')
louvain_count['object_key'] = [int(i) for i in louvain_count['object_key']]
node_louvain = pd.merge(node, louvain_count, how = 'left', on = 'object_key')

indegree['object_key'] = [int(i) for i in indegree['object_key']]
weightedFollowers['object_key'] = [int(i) for i in weightedFollowers['object_key']]
outdegree['object_key'] = [int(i) for i in outdegree['object_key']]
pagerank['object_key'] = [int(i) for i in pagerank['object_key']]
articleRank['object_key'] = [int(i) for i in articleRank['object_key']]

node_louvain = pd.merge(node_louvain, indegree, how = 'left', on = 'object_key')
node_louvain = pd.merge(node_louvain, weightedFollowers, how = 'left', on = 'object_key')
node_louvain = pd.merge(node_louvain, outdegree, how = 'left', on = 'object_key')
node_louvain = pd.merge(node_louvain, pagerank, how = 'left', on = 'object_key')
node_louvain = pd.merge(node_louvain, articleRank, how = 'left', on = 'object_key')
node_louvain.fillna(0, inplace = True)
node_louvain.to_csv('neo4j/node_louvain_tmp.csv', encoding='utf-8', index=False)
th.save(train_idx, 'data/train_idx.txt')
#train_idx = th.load('data/train_idx.txt')
th.save(val_idx, 'data/val_idx.txt')
#val_idx = th.load('data/val_idx.txt')
th.save(test_idx, 'data/test_idx.txt')
#test_idx = th.load('data/test_idx.txt')
print('train_idx------:',len(train_idx))
print('val_idx--------:',len(val_idx))
print('test_idx-------:',len(test_idx))
           

2.转换为图数据

因为利用dgl深度学习框架,需要把给到的关系数据数据处理为dgl可读取的graph数据。

# 1.导入模块
import warnings
warnings.filterwarnings('ignore')
from krbcontext import krbcontext
import numpy as np
import pandas as pd
import json
import torch as th
import dgl
from dgl import save_graphs, load_graphs
from sklearn.metrics import roc_curve, auc
from fastparquet import ParquetFile
import torch
import torch.nn as nn
import torch.nn.functional as F
import random
import time
import os
from matplotlib import pyplot as plt
from matplotlib import font_manager

# 2.读取关系数据
init_rela = pd.read_csv('neo4j/rela.csv', encoding='utf-8')
init_rela.head()

#日期数据处理,转换为日、月、年
import math
init_rela['days'] = abs(init_rela['edge_timestamp'] -365)
init_rela['months'] = [int(i) for i in init_rela['days']/30]
init_rela['years'] = [int(i) for i in init_rela['months']/12]
init_rela = pd.concat([init_rela, init_rela.rename(columns = {'start': 'target', 'target': 'start'})]).reset_index()


# 3.获取节点数据
init_node = pd.read_csv('neo4j/node_louvain_tmp.csv', encoding='utf-8')
print('初始节点数目:', len(init_node))
init_node.head()
init_node.fillna(0, inplace = True)

def fun(x):
    if x <= 30:
        return x
    else:
        return 0
init_node['K'] = init_node['K'].apply(lambda x: fun(x))

index = 'C'
exchange = init_node.loc[init_node[index]!=-1][index].mean()
def fun(x):
    if x == -1:
        return exchange
    else:
        return x
init_node[index] = init_node[index].apply(lambda x: fun(x))
index = 'D'
exchange = init_node.loc[init_node[index]!=-1][index].mean()
def fun(x):
    if x == -1:
        return exchange
    else:
        return x
init_node[index] = init_node[index].apply(lambda x: fun(x))
index = 'E'
exchange = init_node.loc[init_node[index]!=-1][index].mean()
def fun(x):
    if x == -1:
        return exchange
    else:
        return x
init_node[index] = init_node[index].apply(lambda x: fun(x))
index = 'F'
exchange = init_node.loc[init_node[index]!=-1][index].mean()
def fun(x):
    if x == -1:
        return exchange
    else:
        return x
init_node[index] = init_node[index].apply(lambda x: fun(x))
index = 'H'
exchange = init_node.loc[init_node[index]!=-1][index].mean()
def fun(x):
    if x == -1:
        return exchange
    else:
        return x
init_node[index] = init_node[index].apply(lambda x: fun(x))
index = 'I'
exchange = init_node.loc[init_node[index]!=-1][index].mean()
def fun(x):
    if x == -1:
        return exchange
    else:
        return x
init_node[index] = init_node[index].apply(lambda x: fun(x))
index = 'J'
exchange = init_node.loc[init_node[index]!=-1][index].mean()
def fun(x):
    if x == -1:
        return exchange
    else:
        return x
init_node[index] = init_node[index].apply(lambda x: fun(x))
index = 'L'
exchange = init_node.loc[init_node[index]!=-1][index].mean()
def fun(x):
    if x == -1:
        return exchange
    else:
        return x
init_node[index] = init_node[index].apply(lambda x: fun(x))
index = 'M'
exchange = init_node.loc[init_node[index]!=-1][index].mean()
def fun(x):
    if x == -1:
        return exchange
    else:
        return x
init_node[index] = init_node[index].apply(lambda x: fun(x))
index = 'N'
exchange = init_node.loc[init_node[index]!=-1][index].mean()
def fun(x):
    if x == -1:
        return exchange
    else:
        return x
init_node[index] = init_node[index].apply(lambda x: fun(x))
index = 'O'
exchange = init_node.loc[init_node[index]!=-1][index].mean()
def fun(x):
    if x == -1:
        return exchange
    else:
        return x
init_node[index] = init_node[index].apply(lambda x: fun(x))
index = 'P'
exchange = init_node.loc[init_node[index]!=-1][index].mean()
def fun(x):
    if x == -1:
        return exchange
    else:
        return x
init_node[index] = init_node[index].apply(lambda x: fun(x))
index = 'Q'
exchange = init_node.loc[init_node[index]!=-1][index].mean()
def fun(x):
    if x == -1:
        return exchange
    else:
        return x
init_node[index] = init_node[index].apply(lambda x: fun(x))

test = init_node[['A']]
test = pd.get_dummies(test['A'])
test.columns = ['A_' +str(x) for x in test.columns]

init_node = pd.concat([init_node, test],axis = 1)
test = init_node[['B']]
test = pd.get_dummies(test['B'])
test.columns = ['B_' +str(x) for x in test.columns]

init_node = pd.concat([init_node, test],axis = 1)

edge = (init_rela['start'], init_rela['target'])
g = dgl.graph(edge)

g.ndata['label'] = torch.Tensor(init_node['type'])

print(g)

#归一化
df = init_node[['C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N',
       'O', 'P', 'Q', 'counts',
       'indegree', 'weightedFollowers', 'outdegree', 'pagerank', 'articleRank'
       ]]
df = (df-df.min())/(df.max()-df.min())
init_node = pd.concat([init_node[['A_-1.0', 'A_0.0', 'A_1.0', 'B_-1.0', 'B_0.0', 'B_1.0',
       'B_2.0', 'B_3.0', 'B_4.0', 'B_5.0', 'B_6.0', 'B_7.0', 'B_8.0']], df],axis = 1)

g.ndata['features'] = torch.Tensor(np.float32(np.array(init_node[['A_-1.0', 'A_0.0', 'A_1.0', 'B_-1.0', 'B_0.0', 'B_1.0', 'B_2.0',
       'B_3.0', 'B_4.0', 'B_5.0', 'B_6.0', 'B_7.0', 'B_8.0', 'C', 'D', 'E',
       'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'counts',
       'indegree', 'weightedFollowers', 'outdegree', 'pagerank', 'articleRank'
       ]])))

test = init_rela[['edge_type']]
test = pd.get_dummies(test['edge_type'])
test.columns = ['edge_type_' +str(x) for x in test.columns ]
init_rela = pd.concat([init_rela, test],axis = 1)

g.edata['weight'] = torch.Tensor(np.float32(np.array(init_rela[['edge_timestamp', 'edge_type_1', 'edge_type_2', 'edge_type_3',
       'edge_type_4', 'edge_type_5', 'edge_type_6', 'edge_type_7',
       'edge_type_8', 'edge_type_9', 'edge_type_10', 'edge_type_11']])))

print(g)

# 保存图结构信息
save_graphs('graph.bin', g)
           

3.模型训练及预测

from model import *
import numpy as np
from sklearn.feature_extraction import DictVectorizer
import warnings
warnings.filterwarnings("ignore")
from torch import nn
import torch.nn.functional as F
from sklearn import metrics
from sklearn import preprocessing
import torch.utils.data as Data
import sklearn.model_selection as ms
import dgl.nn as dglnn
import dgl
import torch as th
import sklearn.metrics as skm
import dgl.function as fn
import torch 
from dgl.data.utils import generate_mask_tensor
import torch.optim as optim
from sklearn.metrics import recall_score
import time
import sys
from sklearn.preprocessing import MinMaxScaler
import pickle
from dgl.data import DGLDataset, save_graphs, load_graphs
import os

device = 0
path_graph_bin = 'graph.bin'
graph,_ = load_graphs(path_graph_bin)
g = graph[0]
num_nodes = g.num_nodes()

train_idx = th.load('data/train_idx.txt')
valid_idx = th.load('data/val_idx.txt')
test_idx = th.load('data/test_idx.txt')

device = 'cpu'
train_idx = train_idx.to(device)
valid_idx = valid_idx.to(device)
#加入更多数据进行训练
train_idx = torch.cat([train_idx,valid_idx[:300000]],dim=0)
valid_idx = valid_idx[300000:]
test_idx = test_idx.to(device)
g = g.to('cpu')


g.ndata['label'] = th.LongTensor(g.ndata['label'].numpy())

# 设置采样器
sampler = dgl.dataloading.MultiLayerNeighborSampler([25, 15, 10])
in_feats= g.ndata['features'].size()[1]
edge_feats = g.edata['weight'].size()[1]
# 训练 验证 数据dataset
batch_size = 1024
train_dataloader = dgl.dataloading.NodeDataLoader(g,train_idx,sampler,batch_size = batch_size,device=device,shuffle = True,drop_last = False,num_workers = 0)
test_dataloader=dgl.dataloading.NodeDataLoader(g,test_idx,sampler,batch_size =  batch_size,device=device,shuffle = False,drop_last = False,num_workers = 0)
valid_dataloader=dgl.dataloading.NodeDataLoader(g,valid_idx,sampler,batch_size =  batch_size,device=device,shuffle = False,drop_last = False,num_workers = 0)
model = GraphSage(in_feats = in_feats,out_feats = 2,edge_feats=edge_feats)
model.to(device)
opt = optim.Adam(model.parameters(),lr = 0.0001)
#opt = optim.SGD(model.parameters(), lr=0.1)
schedular = optim.lr_scheduler.StepLR(opt, step_size = 15, gamma = 0.5,last_epoch = -1,verbose = False)
#loss_fuc = nn.CrossEntropyLoss(weight =th.FloatTensor([0.03,1]).to(device))
loss_fuc = nn.CrossEntropyLoss(weight =th.FloatTensor([0.011,1]).to(device))
# 开始训练 为了能保存到第600轮的模型 这里range到601
for epoch in range(2):
    model.train()
    for step,(input_nodes,seeds,blocks) in enumerate(train_dataloader):
        batch_inputs =blocks[0].srcdata['features'].to(device)
        batch_labels = g.ndata['label'][seeds].to(device)
        edge_weight = [block.edata['weight'].to(device) for block in blocks]
        batch_pred,embeddings = model(blocks,batch_inputs,edge_weight,device)
        #print()
        loss = loss_fuc(batch_pred, batch_labels)
        opt.zero_grad()
        loss.backward()
        opt.step()
        # 每训练100个step 输出损失和当前学习率
        if step%100==0:
            print(loss.data)#log.info("Epoch: {} Step: {} Loss: {}".format(epoch,step,loss.data))
            print("当前lr : {}".format(opt.param_groups[0]['lr']))

    schedular.step()
    # 每10轮保存一次模型
    if epoch%10 == 0:
        checkpoint = {"model_state_dict": model.state_dict(),"optimizer_state_dict": opt.state_dict(),"epoch": epoch}
        path_checkpoint = "checkpoint_epoch_{}.pkl".format(epoch)
        th.save(checkpoint,path_checkpoint)
        th.save(model, 'model.pkl')      
    # 验证集评价
    out_list = []
    pred_list = []
    label_list = []
    model.eval()
    with torch.no_grad():
        for step,(input_nodes,seeds,blocks) in enumerate(valid_dataloader):
            batch_inputs =blocks[0].srcdata['features'].to(device)
            batch_labels = g.ndata['label'][seeds].to(device)
            edge_weight = [block.edata['weight'].to(device) for block in blocks]
            batch_pred,embeddings = model(blocks,batch_inputs,edge_weight,device)
            out_list+=batch_pred.detach().cpu().numpy().tolist()
            pred_list+=batch_pred.argmax(1).detach().cpu().numpy().tolist()
            label_list+=batch_labels.detach().cpu().numpy().tolist()
        auc = metrics.roc_auc_score(np.array(label_list),F.softmax(th.Tensor(out_list))[:,1].data.numpy())
        acc = np.mean(np.array(pred_list)==np.array(label_list))
        rs= recall_score(label_list,pred_list)
        print("Epoch: {} testAcc: {} testAuc: {} recall_score: {} pred_num: {} label_num: {}".format(epoch,acc,auc,rs,sum(pred_list),sum(label_list)))
             
#预测        
out_list = []
pred_list = []
label_list = []
index = []
embed = []
model.eval()
with torch.no_grad():
    for step,(input_nodes,seeds,blocks) in enumerate(test_dataloader):
        print('-'*10, step)
        batch_inputs =blocks[0].srcdata['features'].to(device)
        batch_labels = g.ndata['label'][seeds].to(device)
        edge_weight = [block.edata['weight'].to(device) for block in blocks]
        batch_pred,embeddings = model(blocks,batch_inputs,edge_weight,device)
        index+=seeds.detach().cpu().numpy().tolist()
        embeddings = embeddings.detach().cpu().numpy().round(4).tolist()
        embed+=embeddings
        out_list+=batch_pred.detach().cpu().numpy().tolist()
        pred_list+=batch_pred.argmax(1).detach().cpu().numpy().tolist()
        label_list+=batch_labels.detach().cpu().numpy().tolist()
preds = np.array(out_list)
np.save('preds.npy', preds)