NLP-情感分析-以豆瓣电影评论为例的详细记录

爬虫获取豆瓣影评+深度学习对评论内容进行情感分类

Posted by Valuebai on September 9, 2019

学习参考的资料

  • https://github.com/aakaking/Sentiment-Analysis
  • https://nbviewer.jupyter.org/github/Computing-Intelligence/assignment2-Project2/blob/master/Assignment2%E4%B8%8EProject2%E7%9A%84%E5%AE%9E%E9%AA%8C%E6%8C%87%E5%AF%BC-%E8%8E%AB%E6%99%BA%E6%96%87.ipynb
  • https://nbviewer.jupyter.org/github/Computing-Intelligence/assignment2-Project2/blob/master/Assignment2%E4%B8%8EProject2%E7%9A%84%E5%AE%9E%E9%AA%8C%E6%8C%87%E5%AF%BC.ipynb

语料:豆瓣影评

  1. 我代码使用的:./input/movie_comments.csv
    • 下载地址:
  2. 自己写爬虫获得豆瓣的电影评论

    详情见方法1:https://github.com/Computing-Intelligence/assignment_get_douban_comment 详情见方法2:https://github.com/Computing-Intelligence/assignment2-Project2/blob/master/DM/Douban_Crawler.py

数据清洗 / EDA

待补充

待整理进来:https://www.csuldw.com/2019/09/28/2019-09-28-comment-sentiment-analysis/ 参考1,但没注释:https://github.com/aakaking/Sentiment-Analysis/blob/master/data_processing.py

第一次优化记录

  1. 存在的问题:找baseline,对豆瓣电影评论进行情感分析,先跑下别人的代码,找下感觉
  2. 准备进行的优化:无
  3. 期待的结果:无
  4. 实际结果:最终Accuracy: 86.334438——1586s 23ms/step - loss: 0.0861 - acc: 0.9669 - val_loss: 0.5078 - val_acc: 0.8689
  5. 原因分析:
    • 数据的清洗过于简单,我需要将这个过程详细记录下
    • 模型只跑了epochs=5次,后面需要优化看下不同次数对结果的影响
    • 需要将数据清洗过程独立出来,写成一个.py文件,保存为clean_***
    • 需要将模型训练过程独立出来,写成一个.py文件

enter description here

#!/usr/bin/env python
# -*- coding: UTF-8 -*-
'''=================================================
@IDE    :PyCharm
@Author :Valuebai
@Date   :2019/11/26 22:49
@Desc   :
=================================================='''

import pandas as pd
import numpy as np
import jieba
from collections import Counter

# 读取本地保存的豆瓣爬虫数据,提取text和rating值并切分正面和负面评论
df_all = pd.read_csv('./input/movie_comments.csv', encoding='utf-8')
comment_negative_ls = []
comment_positive_ls = []
for k in range(len(df_all['comment'])):  # 遍历所有评论条目,将正面和负面评论存入对应list中
    if df_all.loc[k, 'star'] == 1 or df_all.loc[k, 'star'] == 2: comment_negative_ls.append(df_all.loc[k, 'comment'])
    if df_all.loc[k, 'star'] == 4 or df_all.loc[k, 'star'] == 5: comment_positive_ls.append(df_all.loc[k, 'comment'])


def cut(string):
    return list(jieba.cut(str(string)))


def wash(str_original):
    # 要清洗删除的字符
    ls_delete = ['\u3000', '\\n', ',', '。', ':', ':', ',', '“', '”', '(', ')', '《', '》', '!', '!', '、', '(', ')', '·']
    for k in range(len(ls_delete)):
        str_original = str(str_original).replace(ls_delete[k], ' ')  # 用空格替代,相当于标点符号处都固定分词,免得后面分词混淆
    return str_original


# 分词和构建label
comment_NP_ls = comment_negative_ls + comment_positive_ls
comment_NP_ls_cut = [' '.join(cut(s)) for s in [wash(t) for t in comment_NP_ls]]
comment_NP_cut_sum = ''
for k in comment_NP_ls_cut: comment_NP_cut_sum += (k + ' ')
counts = Counter(comment_NP_cut_sum.split())
vocab_size = len(counts)
labels = [0] * len(comment_negative_ls) + [1] * len(comment_positive_ls)

# one_hot编码
from keras.preprocessing.text import one_hot

encoded_docs = [one_hot(d, vocab_size) for d in comment_NP_ls_cut]

# 设置一条评论的最大长度(含词数),多的截断,少的补0。实际评论的平均长度为30
max_length = 30
vector_size = 300
from keras.preprocessing.sequence import pad_sequences

padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post', truncating='post')

# feature和label压缩后乱序,再解压
zipped_docs_labels = list(zip(padded_docs, labels))
import random

random.shuffle(zipped_docs_labels)
padded_docs, labels = zip(*zipped_docs_labels)
padded_docs = np.array(padded_docs)
labels = list(labels)

# 准备训练集和测试集数据
train_ratio = 0.9  # 训练集占全部样本的比例,供切分
train_amount = int(train_ratio * len(labels))  # len(labels)是全体样本的个数
train_x = padded_docs[:train_amount]
train_y = labels[:train_amount]
test_x = padded_docs[train_amount:]
test_y = labels[train_amount:]

# 模型构建
from keras.models import Sequential
from keras.layers.embeddings import Embedding
from keras.layers.core import Flatten, Dense
from keras.layers import Conv1D, LSTM, Dropout, Bidirectional
from keras.layers.convolutional import MaxPooling1D
from keras import regularizers

model = Sequential()
model.add(Embedding(vocab_size, vector_size, input_length=max_length))
model.add(Conv1D(filters=30, kernel_size=3, padding='same', activation='relu'))
model.add(Dropout(0.5))
model.add(MaxPooling1D(pool_size=2))
model.add(Bidirectional(LSTM(50, return_sequences=True)))
# model.add(Bidirectional(LSTM(50)))
model.add(Dropout(0.5))
model.add(Flatten())
# model.add(Dense(1, activation='softmax', kernel_regularizer=regularizers.l2(0.01),activity_regularizer=regularizers.l2(0.01)))
model.add(Dense(1, activation='sigmoid'))

# 模型编译
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
print(model.summary())  # 打印模型信息

# 模型拟合
history = model.fit(train_x, train_y, validation_split=0.2, epochs=5, verbose=1)

import matplotlib.pyplot as plt

plt.plot(history.history['acc'])
plt.plot(history.history['loss'])
plt.title('model accuracy')
plt.ylabel('value')
plt.xlabel('epoch')
plt.legend(['accuracy', 'loss'])
plt.show()

# 模型准确度评估
loss, accuracy = model.evaluate(test_x, test_y, verbose=1)
print('Accuracy: %f' % (accuracy * 100))

【环境】win10_x64 + python3.6.5

【Me】https://github.com/Valuebai/

【参考】 1、出处:地址