公开数据集

维基百科Word2Vec，Apache Spark word2vec由200K维基百科页面培训

132.74M

506 浏览

0 喜欢

0 次下载

0 条讨论

NLP,Business,Earth and Nature,Text Mining Classification

I used Apache Spark to extract more than 6 million phrases from 200,000 English Wikipedia pages. Here is the process of......

数据介绍
文件预览
相关论文
Code
分享讨论(0)
使用声明

启动Notebook开发

数据结构 ? 132.74M

* 以上分析是由系统提取分析形成的结果，具体实际数据为准。

README.md

I used Apache Spark to extract more than 6 million phrases from 200,000 English Wikipedia pages. Here is the process of cleaning, extracting keywords, and training Word2Vec model:

Merging page's Title and its Text
Sentence detection (spark-nlp)
Tokenizer (spark-nlp)
Normalizer (spark-nlp) POS Tagger (spark-nlp) Chuning with grammar rules to detect both uni-grams and multi-grams (spark-nlp)
Stop words remover (Spark ML)
Training and transforming Word2Vec Model (Spark ML)

Content

Word2Vec model details:

val word2Vec = new Word2Vec()
  .setInputCol("filteredPhrases")
  .setOutputCol("word2vec")
  .setVectorSize(300)
  .setMinCount(10)
  .setMaxIter(1)
  .setNumPartitions(1)

Usage

You can simply download this model and load it into your Apache Spark ML pipeline:

import org.apache.spark.ml._

val pipeLineWord2VecModel = PipelineModel.read.load("/tmp/multivac_nlp_ml_200k")
val word2VecModel = pipeLineWord2VecModel.stages.last.asInstanceOf[Word2VecModel]

word2VecModel.findSynonyms("climate change", 10).show(false)
+--------------------------+------------------+
|word                      |similarity        |
+--------------------------+------------------+
|global warming            |0.7534363269805908|
|intergovernmental panel   |0.7303586602210999|
|sustainable development   |0.714561939239502 |
|greenhouse gas emissions  |0.6958430409431458|
|food security             |0.6919037103652954|
|development policy        |0.6879498958587646|
|environmental policy      |0.6868311166763306|
|energy security           |0.681218147277832 |
|multinational corporations|0.6769515872001648|
|tax policy                |0.671006977558136 |
+--------------------------+------------------+

word2VecModel.findSynonyms("football", 10).show(false)
+--------------------------+------------------+
|word                      |similarity        |
+--------------------------+------------------+
|football team             |0.7648624181747437|
|football soccer           |0.7647290229797363|
|field hockey              |0.745803952217102 |
|football teams            |0.7442964911460876|
|soccer                    |0.7377723455429077|
|professional football     |0.7375280261039734|
|youth academy             |0.7372391819953918|
|national basketball league|0.7333077788352966|
|coach                     |0.7324917912483215|
|league championships      |0.7308306694030762|
+--------------------------+------------------+

word2VecModel.findSynonyms("cancer", 10).show(false)
+-----------------------+------------------+
|word                   |similarity        |
+-----------------------+------------------+
|climate change         |0.7534365057945251|
|literature review      |0.7533518075942993|
|minimize               |0.7510043382644653|
|categorization         |0.7404615879058838|
|health effects         |0.7371178269386292|
|genetic information    |0.7362238168716431|
|scientific basis       |0.7347298860549927|
|intergovernmental panel|0.734147846698761 |
|recent study           |0.7333264350891113|
|food security          |0.7322153449058533|
+-----------------------+------------------+

+----------------------+------------------+

word2VecModel.findSynonyms("london", 10).show(false)
|word                  |similarity        |
+----------------------+------------------+
|edinburgh             |0.6135260462760925|
|glasgow               |0.5734920501708984|
|bristol               |0.5710445642471313|
|edinburgh scotland    |0.5306239724159241|
|kensington            |0.5289728045463562|
|islington             |0.5218709707260132|
|clapham               |0.5164309144020081|
|leicester             |0.5161707401275635|
|cambridge             |0.5141464471817017|
|royal scottish academy|0.508998453617096 |
+----------------------+------------------+

Environment

Cloudera CDH 5.15.1
Apache Spark 2.3.1
Spark NLP 1.6.2
Ubuntu 16.4.x

Acknowledgements

This work has been done by using ISC-PIF/CNRS(UPS3611) and Multivac Platform infrastructure.

暂无相关内容。

分享你的想法

去分享你的想法~~

全部内容

欢迎交流分享

开始分享您的观点和意见，和大家一起交流分享.

数据使用声明：

一、数据来源与展示说明：

1、该数据来自于互联网数据采集或服务商的提供，本平台为用户提供数据集的展示与浏览。
2、本平台仅作为数据集的基本信息展示、包括但不限于图像、文本、视频、音频等文件类型。
3、数据集基本信息来自数据原地址或数据提供方提供的信息，如数据集描述中有描述差异，请以数据原地址或服务商原地址为准。

二、所有权说明：

1、本站中的所有数据集的版权都归属于原数据发布者或数据提供方所有。

三、数据转载说明：

1、如您需要转载本站数据，请保留原数据地址及相关版权声明。

四、侵权与处理说明：

1、如本站中的部分数据涉及侵权展示，请及时联系本站，我们会安排进行数据下线。

所需积分：

25 去赚积分？

506浏览
0下载
0点赞
收藏
分享

Select Language

AI社区

今日排行

本月搜索

Dataset Category