{"id":359,"date":"2019-08-06T22:22:06","date_gmt":"2019-08-06T13:22:06","guid":{"rendered":"http:\/\/localhost:8000\/?p=359"},"modified":"2021-01-17T22:38:09","modified_gmt":"2021-01-17T13:38:09","slug":"dataproc-pyspark","status":"publish","type":"post","link":"http:\/\/localhost:8000\/2019\/08\/dataproc-pyspark.html","title":{"rendered":"DataProc\uff0bPySpark\u306eTips"},"content":{"rendered":"

\u74b0\u5883\u5909\u6570\u306e\u5229\u7528<\/h2>\n

DB\u3078\u306e\u63a5\u7d9a\u60c5\u5831\u306a\u3069\u74b0\u5883\u5909\u6570\u304b\u3089\u8a2d\u5b9a\u3059\u308b\u3088\u3046\u306b\u3057\u305f\u3044\u30b1\u30fc\u30b9\u304c\u767a\u751f\u3057\u305d\u3046\u306a\u306e\u3067\u8abf\u67fb\u3057\u307e\u3057\u305f\u3002
\n\u3044\u305a\u308c\u3082\u30c9\u30e9\u30a4\u30d0\u30d7\u30ed\u30b0\u30e9\u30e0\u5185\u3067\u3057\u304b\u5229\u7528\u3067\u304d\u307e\u305b\u3093<\/b>\u3002\u3064\u307e\u308aworker\u30ce\u30fc\u30c9\uff08\u30ea\u30e2\u30fc\u30c8\uff09\u3067\u5229\u7528\u3059\u308b\u305f\u3081\u306b\u306f\u4e00\u65e6\u5909\u6570\u306b\u683c\u7d0d\u3057\u3066\u30a2\u30d7\u30ea\u30b1\u30fc\u30b7\u30e7\u30f3\u30b3\u30fc\u30c9\u5185\u3067\u53c2\u7167\u3059\u308b\u3088\u3046\u306a\u5de5\u592b\u304c\u5fc5\u8981\u3067\u3059\u306e\u3067\u3054\u6ce8\u610f\u304f\u3060\u3055\u3044\u3002<\/p>\n

\u65b9\u6cd51\uff1ainitialization-actions\u3067\u81ea\u524d\u30b9\u30af\u30ea\u30d7\u30c8\u3092\u5b9f\u884c\u3059\u308b<\/h3>\n

initialization-actions\u3067\u81ea\u524d\u306e\u30b9\u30af\u30ea\u30d7\u30c8\u3092\u5b9f\u884c\u3057\u3066\u3001\u305d\u306e\u4e2d\u3067\u74b0\u5883\u5909\u6570\u3092\u8a2d\u5b9a\u3059\u308b\u3068\u3044\u3046\u3068\u3066\u3082\u539f\u59cb\u7684\u306a\u65b9\u6cd5\u3067\u3059\u3002\u3053\u306e\u65b9\u6cd5\u3060\u3068\u5909\u6570\u306e\u5185\u5bb9\u3092\u30d5\u30a1\u30a4\u30eb\u306b\u8a18\u8f09\u3057\u3066\u3001GCS\u306b\u30a2\u30c3\u30d7\u30ed\u30fc\u30c9\u3057\u306a\u304f\u3066\u306f\u306a\u3089\u306a\u3044\u306e\u3067\u6b63\u76f4\u4f7f\u3048\u306a\u3044\u3068\u601d\u3063\u3066\u3044\u307e\u3059\u3002<\/p>\n

    \n
  1. \n

    \u307e\u305a\u3001\u30af\u30e9\u30b9\u30bf\u30fc\u4f5c\u6210\u6642\u306binitialization-actions<\/code>\u30aa\u30d7\u30b7\u30e7\u30f3\u3067\u6307\u5b9a\u3059\u308b\u30b9\u30af\u30ea\u30d7\u30c8\uff08initialize.sh\uff09\u3092\u5b9f\u88c5\u3059\u308b<\/p>\n

    echo "export HOGE=xxx" | tee -a initialize.sh<\/code><\/pre>\n<\/li>\n
  2. \n

    \u30c9\u30e9\u30a4\u30d0\u30d7\u30ed\u30b0\u30e9\u30e0\uff08main.py\uff09\u3092\u5b9f\u88c5\u3059\u308b<\/p>\n

    import os\nimport pyspark\nhoge = os.getenv('HOGE')<\/code><\/pre>\n<\/li>\n
  3. \n

    \u5b9f\u88c5\u3057\u305f\u30d5\u30a1\u30a4\u30eb\u3092GCS\u306b\u30a2\u30c3\u30d7\u30ed\u30fc\u30c9\u3059\u308b<\/p>\n

    gsutil cp initialize.sh gs:\/\/hoge\/\ngsutil cp main.py gs:\/\/hoge\/<\/code><\/pre>\n<\/li>\n
  4. \n

    \u30af\u30e9\u30b9\u30bf\u3092\u4f5c\u6210\u3059\u308b<\/p>\n

      \n
    • --initialization-actions<\/code>\u3067\u4f5c\u6210\u3057\u305finitialize.sh<\/code>\u3092\u6307\u5b9a\u3057\u3066\u3044\u308b\n
      gcloud dataproc clusters create hoge-gluster --initialization-actions='gs:\/\/hoge\/initialize.sh'<\/code><\/pre>\n<\/li>\n<\/ul>\n<\/li>\n
    • \n

      \u30b8\u30e7\u30d6\u3092\u5b9f\u884c\u3059\u308b<\/p>\n

      gcloud dataproc jobs submit pyspark --cluster=hoge-gluster  gs:\/\/hoge\/main.py<\/code><\/pre>\n<\/li>\n<\/ol>\n

      \u65b9\u6cd52\uff1aspark-env.sh\u3067\u74b0\u5883\u5909\u6570\u3092\u8a2d\u5b9a\u3059\u308b\uff08\u63a8\u5968\uff09<\/h3>\n

      --properties<\/code>\u3067spark-env:HOGE=XXX<\/code>\u306e\u3088\u3046\u306b\u6307\u5b9a\u3057\u3066\u304a\u304f\u3068\u3001spark-env.sh<\/code>\u306f\u305d\u306e\u5185\u5bb9\u3092\u8aad\u307f\u8fbc\u3093\u3067\u74b0\u5883\u5909\u6570HOGE=XXX<\/code>\u3092\u8a2d\u5b9a\u3057\u3066\u304f\u308c\u307e\u3059\u3002\u3053\u308c\u304c\u4e00\u756a\u30b9\u30bf\u30f3\u30c0\u30fc\u30c9\u306a\u65b9\u6cd5\u3060\u3068\u601d\u3044\u307e\u3059\u3002<\/p>\n

        \n
      1. \u30c9\u30e9\u30a4\u30d0\u30d7\u30ed\u30b0\u30e9\u30e0\uff08main.py\uff09\u3092\u5b9f\u88c5\u3059\u308b\n
        import pyspark\nimport os\nsc = pyspark.SparkContext()\nhoge =  os.getenv('HOGE')\n:<\/code><\/pre>\n<\/li>\n
      2. \u5b9f\u88c5\u3057\u305f\u30d5\u30a1\u30a4\u30eb\u3092GCS\u306b\u30a2\u30c3\u30d7\u30ed\u30fc\u30c9\u3059\u308b\n
        gsutil cp main.py gs:\/\/hoge\/<\/code><\/pre>\n<\/li>\n
      3. \u30af\u30e9\u30b9\u30bf\u30fc\u3092\u4f5c\u6210\u3059\u308b\n
          \n
        • --properties<\/code>\u30aa\u30d7\u30b7\u30e7\u30f3\u3067\u3001spark-env<\/code>\u30d7\u30ec\u30d5\u30a3\u30c3\u30af\u30b9\u3092\u3064\u3051\u3066\u3001\u74b0\u5883\u5909\u6570\u3092\u8a2d\u5b9a\u3059\u308b\n
          gcloud dataproc clusters create hoge-gluster --properties="spark-env:HOGE=xxx,spark-env:FUGA=yyy"<\/code><\/pre>\n<\/li>\n<\/ul>\n<\/li>\n
        • \u30b8\u30e7\u30d6\u3092\u5b9f\u884c\u3059\u308b\n
          gcloud dataproc jobs submit pyspark --cluster=hoge-gluster  gs:\/\/hoge\/main.py<\/code><\/pre>\n<\/li>\n<\/ol>\n

          \u65b9\u6cd53\uff1aspark-defaults.conf\u306b\u8a2d\u5b9a\u3092\u8ffd\u52a0\u3059\u308b<\/h3>\n

          \u3053\u3061\u3089\u306f\u74b0\u5883\u5909\u6570\u3067\u306f\u306a\u3044\u306e\u3067\u3059\u304c\u3001--properties<\/code>\u3067spark:HOGE=XXX<\/code>\u306e\u3088\u3046\u306b\u6307\u5b9a\u3057\u3066\u304a\u304f\u3068\u3001spark-defaults.conf<\/code>\u306b\u305d\u306e\u5185\u5bb9\u304c\u8a18\u8f09\u3055\u308c\u3001\u8a2d\u5b9a\u60c5\u5831\u3068\u3057\u3066\u30d7\u30ed\u30b0\u30e9\u30e0\u5185\u304b\u3089\u53c2\u7167\u3067\u304d\u307e\u3059\u3002<\/p>\n

            \n
          1. \u30c9\u30e9\u30a4\u30d0\u30d7\u30ed\u30b0\u30e9\u30e0\uff08main.py\uff09\u3092\u5b9f\u88c5\u3059\u308b\n
            import pyspark\nsc = pyspark.SparkContext()\nhoge = sc.getConf().get('HOGE')<\/code><\/pre>\n<\/li>\n
          2. \u5b9f\u88c5\u3057\u305f\u30d5\u30a1\u30a4\u30eb\u3092GCS\u306b\u30a2\u30c3\u30d7\u30ed\u30fc\u30c9\u3059\u308b\n
            gsutil cp main.py gs:\/\/hoge\/<\/code><\/pre>\n<\/li>\n
          3. \u30af\u30e9\u30b9\u30bf\u30fc\u3092\u4f5c\u6210\u3059\u308b\n