{"id":359,"date":"2019-08-06T22:22:06","date_gmt":"2019-08-06T13:22:06","guid":{"rendered":"http:\/\/localhost:8000\/?p=359"},"modified":"2021-01-17T22:38:09","modified_gmt":"2021-01-17T13:38:09","slug":"dataproc-pyspark","status":"publish","type":"post","link":"http:\/\/localhost:8000\/2019\/08\/dataproc-pyspark.html","title":{"rendered":"DataProc\uff0bPySpark\u306eTips"},"content":{"rendered":"
DB\u3078\u306e\u63a5\u7d9a\u60c5\u5831\u306a\u3069\u74b0\u5883\u5909\u6570\u304b\u3089\u8a2d\u5b9a\u3059\u308b\u3088\u3046\u306b\u3057\u305f\u3044\u30b1\u30fc\u30b9\u304c\u767a\u751f\u3057\u305d\u3046\u306a\u306e\u3067\u8abf\u67fb\u3057\u307e\u3057\u305f\u3002
\n\u3044\u305a\u308c\u3082\u30c9\u30e9\u30a4\u30d0\u30d7\u30ed\u30b0\u30e9\u30e0\u5185\u3067\u3057\u304b\u5229\u7528\u3067\u304d\u307e\u305b\u3093<\/b>\u3002\u3064\u307e\u308aworker\u30ce\u30fc\u30c9\uff08\u30ea\u30e2\u30fc\u30c8\uff09\u3067\u5229\u7528\u3059\u308b\u305f\u3081\u306b\u306f\u4e00\u65e6\u5909\u6570\u306b\u683c\u7d0d\u3057\u3066\u30a2\u30d7\u30ea\u30b1\u30fc\u30b7\u30e7\u30f3\u30b3\u30fc\u30c9\u5185\u3067\u53c2\u7167\u3059\u308b\u3088\u3046\u306a\u5de5\u592b\u304c\u5fc5\u8981\u3067\u3059\u306e\u3067\u3054\u6ce8\u610f\u304f\u3060\u3055\u3044\u3002<\/p>\n\u65b9\u6cd51\uff1ainitialization-actions\u3067\u81ea\u524d\u30b9\u30af\u30ea\u30d7\u30c8\u3092\u5b9f\u884c\u3059\u308b<\/h3>\n
initialization-actions\u3067\u81ea\u524d\u306e\u30b9\u30af\u30ea\u30d7\u30c8\u3092\u5b9f\u884c\u3057\u3066\u3001\u305d\u306e\u4e2d\u3067\u74b0\u5883\u5909\u6570\u3092\u8a2d\u5b9a\u3059\u308b\u3068\u3044\u3046\u3068\u3066\u3082\u539f\u59cb\u7684\u306a\u65b9\u6cd5\u3067\u3059\u3002\u3053\u306e\u65b9\u6cd5\u3060\u3068\u5909\u6570\u306e\u5185\u5bb9\u3092\u30d5\u30a1\u30a4\u30eb\u306b\u8a18\u8f09\u3057\u3066\u3001GCS\u306b\u30a2\u30c3\u30d7\u30ed\u30fc\u30c9\u3057\u306a\u304f\u3066\u306f\u306a\u3089\u306a\u3044\u306e\u3067\u6b63\u76f4\u4f7f\u3048\u306a\u3044\u3068\u601d\u3063\u3066\u3044\u307e\u3059\u3002<\/p>\n
\u307e\u305a\u3001\u30af\u30e9\u30b9\u30bf\u30fc\u4f5c\u6210\u6642\u306b \u30c9\u30e9\u30a4\u30d0\u30d7\u30ed\u30b0\u30e9\u30e0\uff08main.py\uff09\u3092\u5b9f\u88c5\u3059\u308b<\/p>\n \u5b9f\u88c5\u3057\u305f\u30d5\u30a1\u30a4\u30eb\u3092GCS\u306b\u30a2\u30c3\u30d7\u30ed\u30fc\u30c9\u3059\u308b<\/p>\n \u30af\u30e9\u30b9\u30bf\u3092\u4f5c\u6210\u3059\u308b<\/p>\n \u30b8\u30e7\u30d6\u3092\u5b9f\u884c\u3059\u308b<\/p>\n \u3053\u3061\u3089\u306f\u74b0\u5883\u5909\u6570\u3067\u306f\u306a\u3044\u306e\u3067\u3059\u304c\u3001 \u6b8b\u5ff5\u306a\u304c\u3089\u5225\u30d5\u30a9\u30eb\u30c0\u306epython\u30d5\u30a1\u30a4\u30eb\u3092import\u3059\u308b\u65b9\u6cd5\u306f\u306a\u3055\u305d\u3046\u3067\u3059\u3002<\/p>\n \u306e\u3088\u3046\u306a\u30d5\u30a9\u30eb\u30c0\u69cb\u6210\u3067\u3001 \u3053\u308c\u3092DataProc\u3067\u5b9f\u884c\u3059\u308b\u3068\u3001\u4ee5\u4e0b\u306e\u3088\u3046\u306b \u3053\u308c\u306f\u3001\u7d50\u5c40 \u306a\u306e\u3067<\/p>\n \u3068\u3059\u308b\u3068\u3001\u30ed\u30fc\u30ab\u30ebPC\u4e0a\u3067\u3082\u3001\u30af\u30e9\u30b9\u30bf\u4e0a\u3067\u3082\u30a8\u30e9\u30fc\u306f\u767a\u751f\u3057\u306a\u304f\u306a\u308a\u307e\u3059\u3002<\/p>\n","protected":false},"excerpt":{"rendered":" \u74b0\u5883\u5909\u6570\u306e\u5229\u7528 DB\u3078\u306e\u63a5\u7d9a\u60c5\u5831\u306a\u3069\u74b0\u5883\u5909\u6570\u304b\u3089\u8a2d\u5b9a\u3059\u308b\u3088\u3046\u306b\u3057\u305f\u3044\u30b1\u30fc\u30b9\u304c\u767a\u751f\u3057\u305d\u3046\u306a\u306e\u3067\u8abf\u67fb\u3057\u307e\u3057\u305f\u3002 \u3044\u305a\u308c\u3082\u30c9\u30e9\u30a4\u30d0\u30d7\u30ed\u30b0\u30e9\u30e0\u5185\u3067\u3057\u304b\u5229\u7528\u3067\u304d\u307e\u305b\u3093\u3002\u3064\u307e\u308aworker\u30ce\u30fc\u30c9\uff08\u30ea\u30e2\u30fc\u30c8\uff09\u3067\u5229\u7528\u3059\u308b\u305f\u3081\u306b\u306f\u4e00\u65e6\u5909\u6570\u306b\u683c\u7d0d\u3057\u3066\u30a2\u30d7\u30ea\u30b1\u30fc\u30b7\u30e7\u30f3\u30b3\u30fc\u30c9\u5185\u3067\u53c2\u7167\u3059\u308b\u3088\u3046\u306a\u5de5\u592b\u304c\u5fc5\u8981\u3067\u3059\u306e\u3067\u3054\u6ce8\u610f\u304f\u3060\u3055\u3044\u3002 \u65b9\u6cd51\uff1ainitialization-actions\u3067\u81ea\u524d\u30b9\u30af\u30ea\u30d7\u30c8\u3092\u5b9f\u884c\u3059\u308b initialization-actions\u3067\u81ea\u524d\u306e\u30b9\u30af\u30ea\u30d7\u30c8\u3092\u5b9f\u884c\u3057\u3066\u3001\u305d\u306e\u4e2d\u3067\u74b0\u5883\u5909\u6570\u3092\u8a2d\u5b9a\u3059\u308b\u3068\u3044\u3046\u3068\u3066\u3082\u539f\u59cb\u7684\u306a\u65b9\u6cd5\u3067\u3059\u3002\u3053\u306e\u65b9\u6cd5\u3060\u3068\u5909\u6570\u306e\u5185\u5bb9\u3092\u30d5\u30a1\u30a4\u30eb\u306b\u8a18\u8f09\u3057\u3066\u3001GCS\u306b\u30a2\u30c3\u30d7\u30ed\u30fc\u30c9\u3057\u306a\u304f\u3066\u306f\u306a\u3089\u306a\u3044\u306e\u3067\u6b63\u76f4\u4f7f\u3048\u306a\u3044\u3068\u601d\u3063\u3066\u3044\u307e\u3059\u3002 \u307e\u305a\u3001\u30af\u30e9\u30b9\u30bf\u30fc\u4f5c\u6210\u6642\u306binitialization-actions\u30aa\u30d7\u30b7\u30e7\u30f3\u3067\u6307\u5b9a\u3059\u308b\u30b9\u30af\u30ea\u30d7\u30c8\uff08initialize.sh\uff09\u3092\u5b9f\u88c5\u3059\u308b echo "exp <\/span>Continue Reading<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[7,24],"tags":[],"_links":{"self":[{"href":"http:\/\/localhost:8000\/wp-json\/wp\/v2\/posts\/359"}],"collection":[{"href":"http:\/\/localhost:8000\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/localhost:8000\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/localhost:8000\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/localhost:8000\/wp-json\/wp\/v2\/comments?post=359"}],"version-history":[{"count":1,"href":"http:\/\/localhost:8000\/wp-json\/wp\/v2\/posts\/359\/revisions"}],"predecessor-version":[{"id":360,"href":"http:\/\/localhost:8000\/wp-json\/wp\/v2\/posts\/359\/revisions\/360"}],"wp:attachment":[{"href":"http:\/\/localhost:8000\/wp-json\/wp\/v2\/media?parent=359"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/localhost:8000\/wp-json\/wp\/v2\/categories?post=359"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/localhost:8000\/wp-json\/wp\/v2\/tags?post=359"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}initialization-actions<\/code>\u30aa\u30d7\u30b7\u30e7\u30f3\u3067\u6307\u5b9a\u3059\u308b\u30b9\u30af\u30ea\u30d7\u30c8\uff08initialize.sh\uff09\u3092\u5b9f\u88c5\u3059\u308b<\/p>\n
echo "export HOGE=xxx" | tee -a initialize.sh<\/code><\/pre>\n<\/li>\n
import os\nimport pyspark\nhoge = os.getenv('HOGE')<\/code><\/pre>\n<\/li>\n
gsutil cp initialize.sh gs:\/\/hoge\/\ngsutil cp main.py gs:\/\/hoge\/<\/code><\/pre>\n<\/li>\n
\n
--initialization-actions<\/code>\u3067\u4f5c\u6210\u3057\u305f
initialize.sh<\/code>\u3092\u6307\u5b9a\u3057\u3066\u3044\u308b\n
gcloud dataproc clusters create hoge-gluster --initialization-actions='gs:\/\/hoge\/initialize.sh'<\/code><\/pre>\n<\/li>\n<\/ul>\n<\/li>\n
gcloud dataproc jobs submit pyspark --cluster=hoge-gluster gs:\/\/hoge\/main.py<\/code><\/pre>\n<\/li>\n<\/ol>\n
\u65b9\u6cd52\uff1aspark-env.sh\u3067\u74b0\u5883\u5909\u6570\u3092\u8a2d\u5b9a\u3059\u308b\uff08\u63a8\u5968\uff09<\/h3>\n
--properties<\/code>\u3067
spark-env:HOGE=XXX<\/code>\u306e\u3088\u3046\u306b\u6307\u5b9a\u3057\u3066\u304a\u304f\u3068\u3001
spark-env.sh<\/code>\u306f\u305d\u306e\u5185\u5bb9\u3092\u8aad\u307f\u8fbc\u3093\u3067\u74b0\u5883\u5909\u6570
HOGE=XXX<\/code>\u3092\u8a2d\u5b9a\u3057\u3066\u304f\u308c\u307e\u3059\u3002\u3053\u308c\u304c\u4e00\u756a\u30b9\u30bf\u30f3\u30c0\u30fc\u30c9\u306a\u65b9\u6cd5\u3060\u3068\u601d\u3044\u307e\u3059\u3002<\/p>\n
\n
import pyspark\nimport os\nsc = pyspark.SparkContext()\nhoge = os.getenv('HOGE')\n:<\/code><\/pre>\n<\/li>\n
gsutil cp main.py gs:\/\/hoge\/<\/code><\/pre>\n<\/li>\n
\n
--properties<\/code>\u30aa\u30d7\u30b7\u30e7\u30f3\u3067\u3001
spark-env<\/code>\u30d7\u30ec\u30d5\u30a3\u30c3\u30af\u30b9\u3092\u3064\u3051\u3066\u3001\u74b0\u5883\u5909\u6570\u3092\u8a2d\u5b9a\u3059\u308b\n
gcloud dataproc clusters create hoge-gluster --properties="spark-env:HOGE=xxx,spark-env:FUGA=yyy"<\/code><\/pre>\n<\/li>\n<\/ul>\n<\/li>\n
gcloud dataproc jobs submit pyspark --cluster=hoge-gluster gs:\/\/hoge\/main.py<\/code><\/pre>\n<\/li>\n<\/ol>\n
\u65b9\u6cd53\uff1aspark-defaults.conf\u306b\u8a2d\u5b9a\u3092\u8ffd\u52a0\u3059\u308b<\/h3>\n
--properties<\/code>\u3067
spark:HOGE=XXX<\/code>\u306e\u3088\u3046\u306b\u6307\u5b9a\u3057\u3066\u304a\u304f\u3068\u3001
spark-defaults.conf<\/code>\u306b\u305d\u306e\u5185\u5bb9\u304c\u8a18\u8f09\u3055\u308c\u3001\u8a2d\u5b9a\u60c5\u5831\u3068\u3057\u3066\u30d7\u30ed\u30b0\u30e9\u30e0\u5185\u304b\u3089\u53c2\u7167\u3067\u304d\u307e\u3059\u3002<\/p>\n
\n
import pyspark\nsc = pyspark.SparkContext()\nhoge = sc.getConf().get('HOGE')<\/code><\/pre>\n<\/li>\n
gsutil cp main.py gs:\/\/hoge\/<\/code><\/pre>\n<\/li>\n
\n
--properties<\/code>\u30aa\u30d7\u30b7\u30e7\u30f3\u3067\u3001spark\u306eexecutorEnv\u3092\u6307\u5b9a\u3057\u3066\u3044\u308b\n
gcloud dataproc clusters create hoge-gluster --properties="spark:HOGE=xxx,spark:FUGA=yyy"<\/code><\/pre>\n<\/li>\n<\/ul>\n<\/li>\n
gcloud dataproc jobs submit pyspark --cluster=hoge-gluster gs:\/\/hoge\/main.py<\/code><\/pre>\n<\/li>\n<\/ol>\n
\u5225\u30d5\u30a9\u30eb\u30c0\u306epython\u30d5\u30a1\u30a4\u30eb\u3092import<\/h2>\n
hoge.py\nmodels\n\u2517fuga.py<\/code><\/pre>\n
hoge.py<\/code>\u3067\u306f\u3001\u4ee5\u4e0b\u306e\u3088\u3046\u306b\u3067
fuga.py<\/code>\u3092import\u3059\u308b\u30b1\u30fc\u30b9\u3092\u8a66\u3057\u307e\u3057\u305f\u3002<\/p>\n
from models import fuga<\/code><\/pre>\n
ImportError: No module named models<\/code>\u3068\u3044\u3046\u30a8\u30e9\u30fc\u304c\u767a\u751f\u3057\u307e\u3059\u3002<\/p>\n
gcloud dataproc jobs submit pyspark \\\n --cluster=hoge-cluster \\\n --files='gs:\/\/hoge-bucket\/models\/fuga.py' \\\n gs:\/\/hoge-bucket\/hoge.py\n\n> Traceback (most recent call last):\n> File "\/tmp\/987eqbkewqvf0uh9qet43\/hoge.py", line 9, in <module>\n> from models import grants\n> ImportError: No module named models<\/code><\/pre>\n
--files<\/code>\u3067\u6307\u5b9a\u3057\u305f\u30d5\u30a1\u30a4\u30eb\u306f\u3001\u5143\u306e\u30d5\u30a9\u30eb\u30c0\u69cb\u6210\u306b\u95a2\u4fc2\u306a\u304f\u5b9f\u884c\u30d5\u30a1\u30a4\u30eb\uff08hoge.py\uff09\u3068\u540c\u3058\u30c6\u30f3\u30dd\u30e9\u30ea\u30d5\u30a9\u30eb\u30c0\u306b\u5c55\u958b\u3055\u308c\u3066\u5b9f\u884c\u3055\u308c\u308b\u305f\u3081\u3067\u3059\u3002<\/p>\n
try:\n from models import fuga\nexcept ModuleNotFoundError:\n import fuga<\/code><\/pre>\n