NLP-语料数据集

记录收藏各大语料数据集

Posted by Valuebai on September 1, 2019

funNLP各种语料集合

  • https://github.com/fighting41love/funNLP

中文停用词

# 加载停用词表
stopwords1 = [line.rstrip() for line in open(os.path.join(stop_words_path, '中文停用词库.txt'), 'r', encoding='utf-8')]
stopwords2 = [line.rstrip() for line in open(os.path.join(stop_words_path, '哈工大停用词表.txt'), 'r', encoding='utf-8')]
stopwords3 = [line.rstrip() for line in open(os.path.join(stop_words_path, '四川大学机器智能实验室停用词库.txt'), 'r', encoding='utf-8')]
stopwords = stopwords1 + stopwords2 + stopwords3
  • https://github.com/jwx0539/ML_CheckComments/tree/master/ML_CheckComments/stop_words

豆瓣电影评论

  • 【下载地址】https://www.kaggle.com/utmhikari/doubanmovieshortcomments

  • Kaggle提供的数据集。数据集(DMSC.csv)为CSV文件。该数据集包含了28部电影的200万条影评。可以用于文本分类,聚类,情感分析等。

【参考】 1、出处:地址