{"id":263,"date":"2020-04-29T19:26:11","date_gmt":"2020-04-29T10:26:11","guid":{"rendered":"http:\/\/localhost:8000\/?p=263"},"modified":"2021-01-16T19:29:41","modified_gmt":"2021-01-16T10:29:41","slug":"pyspark-samples","status":"publish","type":"post","link":"http:\/\/localhost:8000\/2020\/04\/pyspark-samples.html","title":{"rendered":"PySpark\u306e\u5b9f\u88c5\u30b5\u30f3\u30d7\u30eb\u3068\u5b9f\u884c\u65b9\u6cd5"},"content":{"rendered":"
\u6700\u8fd1Spark\u3092\u89e6\u308b\u6a5f\u4f1a\u304c\u3042\u3063\u3066\u3001\u5c11\u3057\u3060\u3051\u52c9\u5f37\u3057\u305f\u306e\u3067\u30e1\u30e2\u304c\u3066\u3089\u6b8b\u3057\u3066\u304a\u304d\u307e\u3059\u3002<\/p>\n
\u753b\u50cf\u306f\u3001\u3053\u3061\u3089<\/a>\u304b\u3089\u304a\u501f\u308a\u3057\u307e\u3057\u305f\u3002 \u81ea\u5206\u306e\u7406\u89e3\u3067\u306f\u3001\u4ee5\u4e0b\u306e\u3088\u3046\u306a\u51e6\u7406\u306e\u6d41\u308c\u306b\u306a\u3063\u3066\u3044\u308b\u307f\u305f\u3044\u3067\u3059\u3002 \u8a66\u9a13\u306e\u7d50\u679c\u304b\u3089\u300c\u5404\u6559\u79d1\u306e\u5e73\u5747\u70b9\u300d\u3068\u300c4\u6559\u79d1\u5408\u8a08\u306e\u5e73\u5747\u70b9\u300d\u3092\u7b97\u51fa\u3059\u308b\u3068\u3044\u3046\u6bd4\u8f03\u7684\u30b7\u30f3\u30d7\u30eb\u306a\u51e6\u7406\u3092\u8003\u3048\u3066\u307f\u3088\u3046\u3068\u601d\u3044\u307e\u3059\u3002<\/p>\n \u307e\u305a\u3001SparkContext\u3092\u751f\u6210\u3057\u3066\u3044\u307e\u3059\u3002<\/p>\n \u524d\u8ff0\u306e\u69cb\u6210\u8981\u7d20\u306e\u3068\u3053\u308d\u3067\u8aac\u660e\u3057\u305f\u3084\u3064\u3067\u3059\u3002DAG Scheduler\u3084Task Scheduler\u306e\u5f79\u5272\u3092\u679c\u305f\u3057\u3066\u304f\u308c\u307e\u3059\u3002<\/p>\n \u6b21\u306b\u3001RDD\u3092\u751f\u6210\u3057\u3066\u3044\u307e\u3059\u3002<\/p>\n RDD\u306f\u5fa9\u5143\u529b\u306e\u3042\u308b\u5206\u6563\u30c7\u30fc\u30bf\u30bb\u30c3\u30c8\u3067\u3001\u305f\u3060List\u3092\u5206\u5272\u3057\u305f\u3060\u3051\u3067\u306a\u304f\u3001\u3044\u308d\u3093\u306a\u7279\u5fb4\u304c\u3042\u308a\u307e\u3059\u3002<\/p>\n \u306a\u3069\u306e\u7279\u5fb4\u3092\u6301\u3063\u3066\u3044\u307e\u3059\u3002<\/p>\n \u6700\u5f8c\u306b\u3001RDD\u306b\u5bfe\u3057\u3066\u4e00\u9023\u306e\u5909\u63db\u30bf\u30b9\u30af\u884c\u3044\u3001\u7d50\u679c\u3092\u53d6\u5f97\u3057\u3066\u3044\u307e\u3059\u3002<\/p>\n \u5177\u4f53\u7684\u306b\u306f\u3001\u4ee5\u4e0b\u306e\u3088\u3046\u306a\u6d41\u308c\u3067\u51e6\u7406\u3092\u884c\u306a\u3063\u3066\u3044\u307e\u3059\u3002<\/p>\n Stage\u69cb\u6210\u306f\u3053\u3093\u306a\u611f\u3058\u3067\u3059\u3002<\/p>\n [f:id:rinoguchi:20200429125214p:plain:w500]<\/p>\n \u3053\u308c\u304b\u30893\u3064\u306e\u65b9\u6cd5\u3067\u5148\u307b\u3069\u5b9f\u88c5\u3057\u305f\u5206\u6563\u51e6\u7406\u30d7\u30ed\u30b0\u30e9\u30e0\uff08main.py\uff09\u3092\u5b9f\u884c\u3057\u3066\u307f\u305f\u3044\u3068\u3082\u601d\u3044\u307e\u3059\u3002\u30d7\u30ed\u30b0\u30e9\u30e0\u306f\u3069\u306e\u65b9\u6cd5\u3067\u3082\u5909\u66f4\u306e\u5fc5\u8981\u306f\u3042\u308a\u307e\u305b\u3093\u3002<\/p>\n \u3053\u3061\u3089<\/a>\u306e\u300cDataproc\u5468\u308a\u306e\u74b0\u5883\u6e96\u5099\u300d\u306e\u9805\u3092\u53c2\u7167\u3057\u3001Dataproc\u3092\u6709\u52b9\u5316\u3057\u3066\u30b5\u30fc\u30d3\u30b9\u30a2\u30ab\u30a6\u30f3\u30c8\u3092\u4f5c\u6210\u3057\u5fc5\u8981\u306a\u6a29\u9650\u3092\u4e0e\u3048\u307e\u3059\u3002<\/p>\n \u500b\u4eba\u7684\u306b\u306f\u3001<\/p>\n \u3068\u3044\u3046\u6d41\u308c\u3067\u958b\u767a\u3092\u884c\u3044\u307e\u3057\u305f\u304c\u3001\u3053\u308c\u306f\u7d50\u69cb\u3057\u3063\u304f\u308a\u304d\u307e\u3057\u305f\u3002<\/p>\n \u6700\u8fd1Spark\u3092\u89e6\u308b\u6a5f\u4f1a\u304c\u3042\u3063\u3066\u3001\u5c11\u3057\u3060\u3051\u52c9\u5f37\u3057\u305f\u306e\u3067\u30e1\u30e2\u304c\u3066\u3089\u6b8b\u3057\u3066\u304a\u304d\u307e\u3059\u3002 Spark\u306e\u5206\u6563\u51e6\u7406\u306e\u4ed5\u7d44\u307f Spark\u3068\u306f \u9ad8\u901f\u3067\u6c4e\u7528\u7684\u306a\u5206\u6563\u51e6\u7406\u30b7\u30b9\u30c6\u30e0 \u5206\u6563\u30c7\u30fc\u30bf\uff08RDD\uff09\u3092DISK\u3092\u4ecb\u3055\u305a\u306b\u30e1\u30e2\u30ea\u4e0a\u306b\u6301\u3064\u306e\u3067\u3001Hadoop\u306e100\u500d\u3050\u3089\u3044\u9ad8\u901f Java, Scala, Python, R\u306a\u3069\u306eAPI\u3092\u63d0\u4f9b Spark SQL, MLlib, GraphX, Spark Streaming\u306a\u3069\u306e\u30ea\u30c3\u30c1\u306a\u30c4\u30fc\u30eb\u3092\u63d0\u4f9b \u5206\u6563\u51e6\u7406\u30b7\u30b9\u30c6\u30e0\u306e\u69cb\u6210\u8981\u7d20 \u753b\u50cf\u306f\u3001\u3053\u3061\u3089\u304b\u3089\u304a\u501f\u308a\u3057\u307e\u3057\u305f\u3002 Driver Program Master Node\u3067\u5b9f\u884c\u3055\u308c\u308b\u8d77\u70b9\u3068\u306a\u308b\u30d7\u30ed\u30b0\u30e9\u30e0 SparkContext\u3092\u4f5c\u6210\u3057\u3001RDD\u3092\u751f\u6210\u3057\u3066\u3001Task\u3092\u5b9f\u884c\u3057\u3066\u3044\u304f SparkContext Spark\u306e\u8272\u3005\u306a\u6a5f\u80fd\u3078\u306e\u30a8\u30f3\u30c8\u30ea\u30fc\u30dd\u30a4\u30f3\u30c8 ClusterManager\u3092\u901a\u3058\u3066\u30af\u30e9\u30b9\u30bf\u30fc\u3092\u64cd\u4f5c\u3059\u308b DAG Sche <\/span>Continue Reading<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[7,24],"tags":[],"_links":{"self":[{"href":"http:\/\/localhost:8000\/wp-json\/wp\/v2\/posts\/263"}],"collection":[{"href":"http:\/\/localhost:8000\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/localhost:8000\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/localhost:8000\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/localhost:8000\/wp-json\/wp\/v2\/comments?post=263"}],"version-history":[{"count":1,"href":"http:\/\/localhost:8000\/wp-json\/wp\/v2\/posts\/263\/revisions"}],"predecessor-version":[{"id":266,"href":"http:\/\/localhost:8000\/wp-json\/wp\/v2\/posts\/263\/revisions\/266"}],"wp:attachment":[{"href":"http:\/\/localhost:8000\/wp-json\/wp\/v2\/media?parent=263"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/localhost:8000\/wp-json\/wp\/v2\/categories?post=263"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/localhost:8000\/wp-json\/wp\/v2\/tags?post=263"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}
\n<\/p>\n\n
\n
\n
\n
\n
\u51e6\u7406\u306e\u6d41\u308c<\/h3>\n
\n<\/p>\n\u5206\u6563\u51e6\u7406\u306e\u5b9f\u88c5\u30b5\u30f3\u30d7\u30eb<\/h2>\n
\u51e6\u7406\u5185\u5bb9<\/h3>\n
\u30bd\u30fc\u30b9\u30b3\u30fc\u30c9<\/h3>\n
from pyspark import SparkContext, RDD\nfrom typing import List, Dict\n# input\u30c7\u30fc\u30bf\uff08\u8a66\u9a13\u306e\u7d50\u679c\uff09\ninput_data: List[Dict[str, int]] = [\n {'\u56fd\u8a9e': 86, '\u7b97\u6570': 57, '\u7406\u79d1': 45, '\u793e\u4f1a': 100},\n {'\u56fd\u8a9e': 67, '\u7b97\u6570': 12, '\u7406\u79d1': 43, '\u793e\u4f1a': 54},\n {'\u56fd\u8a9e': 98, '\u7b97\u6570': 98, '\u7406\u79d1': 78, '\u793e\u4f1a': 69},\n]\n# SparkContext, RDD\u4f5c\u6210\nsc: SparkContext = SparkContext(appName='spark_sample')\nrdd: RDD = sc.parallelize(input_data)\n# \u5404\u6559\u79d1\u304a\u3088\u3073\u5408\u8a08\u70b9\u306e\u5e73\u5747\u70b9\u3092\u8a08\u7b97\noutput: Dict[str, float] = rdd\\\n .map(lambda x: x.update(\u5408\u8a08=sum(x.values())) or x)\\\n .flatMap(lambda x: x.items())\\\n .groupByKey()\\\n .map(lambda x: (x[0], round(sum(x[1]) \/ len(x[1]), 2)))\\\n .collect()<\/code><\/pre>\n
\u89e3\u8aac<\/h3>\n
sc: SparkContext = SparkContext(appName='spark_sample')<\/code><\/pre>\n
rdd: RDD = sc.parallelize(input_data)<\/code><\/pre>\n
\n
collect<\/code>\u3084
foreach<\/code>\u306a\u3069\u306eAction\u3068\u547c\u3070\u308c\u308bTask\u304c\u5b9f\u884c\u3055\u308c\u308b\u307e\u3067\u51e6\u7406\u3055\u308c\u306a\u3044\u3002Action\u304c\u5b9f\u884c\u3055\u308c\u305f\u30bf\u30a4\u30df\u30f3\u30b0\u3067\u5fc5\u8981\u306aTask\u3060\u3051\u5b9f\u884c\u3055\u308c\u308b\uff08=job\u304csubmit\u3055\u308c\u308b\uff09<\/li>\n<\/ul>\n
output: Dict[str, float] = rdd\\\n .map(lambda x: x.update(\u5408\u8a08=sum(x.values())) or x)\\\n .flatMap(lambda x: x.items())\\\n .groupByKey()\\\n .map(lambda x: (x[0], round(sum(x[1]) \/ len(x[1]), 2)))\\\n .collect()<\/code><\/pre>\n
\n
{ \u2018\u56fd\u8a9e\u2019: 86, \u2018\u7b97\u6570\u2019: 57, \u2026, \u2018\u5408\u8a08\u2019: 288 }<\/code><\/li>\n
[(\u2018\u56fd\u8a9e\u2019, 86), (\u2018\u7b97\u6570\u2019, 57), \u2026 , (\u2018\u5408\u8a08\u2019, 288)]<\/code><\/li>\n
(\u2018\u56fd\u8a9e\u2019, [86, 67, 98])<\/code><\/li>\n
(\u2018\u56fd\u8a9e\u2019, 83.67)<\/code><\/li>\n
[(\u2018\u56fd\u8a9e\u2019, 83.67), (\u2018\u7b97\u6570\u2019, 55.67), \u2026, (\u2018\u5408\u8a08\u2019, 269.0)]<\/code><\/li>\n<\/ol>\n
collect<\/code>\u306a\u3069\u306eAction Task\u306e\u4ed6\u306b\u3001
groupByKey<\/code>\u306e\u3088\u3046\u306aShuffle Task\u3067\u3082Stage\u304c\u5206\u5272\u3055\u308c\u307e\u3059\u3002
ByKey<\/code>\u3067\u7d42\u308f\u308b\u3088\u3046\u306aTask\u304cShuffle Task\u3067\u3059\u3002\u4e00\u3064\u306eStage\u306e\u4e00\u9023\u306eTask\u306f\u4e00\u3064\u306eNode\u3067\u4e00\u62ec\u3067\u51e6\u7406\u3055\u308c\u307e\u3059\u3002<\/p>\n
\u5206\u6563\u51e6\u7406\u306e\u5b9f\u884c<\/h2>\n
pyspark\u30e9\u30a4\u30d6\u30e9\u30ea\u3092\u4f7f\u3063\u3066\u5b9f\u884c<\/h3>\n
\u5b9f\u884c\u65b9\u6cd5<\/h4>\n
# \u5fc5\u8981\u306a\u30e9\u30a4\u30d6\u30e9\u30ea\u3092\u30a4\u30f3\u30b9\u30c8\u30fc\u30eb\npip install pyspark\npip install py4j\n\n# \u5b9f\u884c\npython main.py<\/code><\/pre>\n
\u7279\u5fb4<\/h4>\n
\n
Spark Standalone Mode\u3067\u5b9f\u884c<\/h3>\n
\u5b9f\u884c\u65b9\u6cd5<\/h4>\n
# apache-spark\u3092\u30a4\u30f3\u30b9\u30c8\u30fc\u30eb\nbrew install apache-spark\nexport SPARK_HOME=\/usr\/local\/Cellar\/apache-spark\/2.4.5\/libexec\n\n# Master Node\u3092\u7acb\u3061\u4e0a\u3052\u308b\n# \u7acb\u3061\u4e0a\u3052\u305f\u5f8c\u306b\u3001http:\/\/localhost:8080\u306b\u30a2\u30af\u30bb\u30b9\u3057\u3066\u3001Master-URL\u3092\u78ba\u8a8d\u3059\u308b\n# \u4f8b) spark:\/\/hogeMacBook-Pro:7077\n${SPARK_HOME}\/sbin\/start-master.sh\n\n# Worker Node\u3092\u7acb\u3061\u4e0a\u3052\u308b\n${SPARK_HOME}\/sbin\/start-slave.sh -c 2 spark:\/\/InoguchinoMacBook-Pro:7077\n\n# job\u3092submit\u3059\u308b\nspark-submit main.py --master spark:\/\/InoguchinoMacBook-Pro:7077<\/code><\/pre>\n
\u7279\u5fb4<\/h4>\n
\n
\n
\n
concurrent.futures<\/code>\u3068\u304b\u3067\u3082\u3067\u304d\u308b\u306e\u3067<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n
Google Cloud Dataproc\u3067\u5b9f\u884c<\/h3>\n
Dataproc\u306e\u74b0\u5883\u6e96\u5099<\/h4>\n
\u5b9f\u884c\u65b9\u6cd5<\/h4>\n
# \u30b5\u30fc\u30d3\u30b9\u30a2\u30ab\u30a6\u30f3\u30c8\u3092\u6709\u52b9\u5316\ngcloud auth activate-service-account --key-file {key_file_json_path} --project {project_name}\n\n# \u30af\u30e9\u30b9\u30bf\u7acb\u3061\u4e0a\u3052\ngcloud dataproc clusters create spark-sample-cluster\\\n --bucket={bucket_name} --region=asia-east1 --image-version=1.4\\ \n --master-machine-type=n1-standard-1 --worker-machine-type=n1-standard-1\\\n --num-workers=2 --worker-boot-disk-size=256 --max-idle=1h\n\n# job\u3092submit\ngcloud dataproc jobs submit pyspark\\\n --cluster=spark-sample-cluster --region=asia-east1\\\n gs:\/\/hoge_dataproc-test\/dataproc\/src\/simple\/main.py<\/code><\/pre>\n
\u7279\u5fb4<\/h4>\n
\n
\u307e\u3068\u3081<\/h2>\n
\n
\u53c2\u8003URL<\/h2>\n
\n
\n
\n