funNLP各种语料集合
- https://github.com/fighting41love/funNLP
中文停用词
# 加载停用词表
stopwords1 = [line.rstrip() for line in open(os.path.join(stop_words_path, '中文停用词库.txt'), 'r', encoding='utf-8')]
stopwords2 = [line.rstrip() for line in open(os.path.join(stop_words_path, '哈工大停用词表.txt'), 'r', encoding='utf-8')]
stopwords3 = [line.rstrip() for line in open(os.path.join(stop_words_path, '四川大学机器智能实验室停用词库.txt'), 'r', encoding='utf-8')]
stopwords = stopwords1 + stopwords2 + stopwords3
- https://github.com/jwx0539/ML_CheckComments/tree/master/ML_CheckComments/stop_words
豆瓣电影评论
-
【下载地址】https://www.kaggle.com/utmhikari/doubanmovieshortcomments
-
Kaggle提供的数据集。数据集(DMSC.csv)为CSV文件。该数据集包含了28部电影的200万条影评。可以用于文本分类,聚类,情感分析等。
【参考】 1、出处:地址