{"id":209,"date":"2020-08-13T10:01:53","date_gmt":"2020-08-13T01:01:53","guid":{"rendered":"http:\/\/localhost:8000\/?p=209"},"modified":"2021-01-16T10:07:31","modified_gmt":"2021-01-16T01:07:31","slug":"scrapy-manual","status":"publish","type":"post","link":"http:\/\/localhost:8000\/2020\/08\/scrapy-manual.html","title":{"rendered":"Scrapy\u306e\u4f7f\u3044\u65b9"},"content":{"rendered":"
Scrapy<\/a> \u306f\u3001Web\u30b5\u30a4\u30c8\u3092\u30af\u30ed\u30fc\u30eb\u3057\u3001\u30da\u30fc\u30b8\u304b\u3089\u69cb\u9020\u5316\u30c7\u30fc\u30bf\u3092\u62bd\u51fa\u3059\u308b\u305f\u3081\u306b\u4f7f\u7528\u3055\u308c\u308bWeb\u30b9\u30af\u30ec\u30a4\u30d4\u30f3\u30b0\u30d5\u30ec\u30fc\u30e0\u30ef\u30fc\u30af\u3067\u3059\u3002Scrapy\u306b\u95a2\u3057\u3066\u306f\u308f\u304b\u308a\u3084\u3059\u3044\u8a18\u4e8b\u304c\u305f\u304f\u3055\u3093\u3042\u308b\u306e\u3067\u3001\u3053\u3053\u3067\u306f\u5b9f\u88c5\u30b5\u30f3\u30d7\u30eb\u3092\u7d39\u4ecb\u3057\u307e\u304f\u308b\u30b9\u30bf\u30f3\u30b9\u306b\u3057\u3088\u3046\u3068\u601d\u3044\u307e\u3059\u3002 \u3053\u3061\u3089<\/a>\u306b\u3057\u305f\u304c\u3063\u3066\u3001\u30c1\u30e5\u30fc\u30c8\u30ea\u30a2\u30eb\u3092\u8a66\u3057\u3066\u307f\u307e\u3059\u3002<\/p>\n \u3092\u5b9f\u884c\u3059\u308b\u3068 \u30af\u30ed\u30fc\u30ea\u30f3\u30b0\u3092\u5b9f\u884c\u3057\u307e\u3059\u3002<\/p>\n \u4f55\u3082\u8003\u3048\u305a\u30c1\u30e5\u30fc\u30c8\u30ea\u30a2\u30eb\u3092\u5b9f\u884c\u3057\u3066\u307f\u307e\u3057\u305f\u304c\u3001Scrapy\u306b\u306f\u3044\u304f\u3064\u304b\u306e\u767b\u5834\u4eba\u7269\u304c\u3044\u307e\u3059\u3002<\/p>\n \u4e0a\u306e\u30c1\u30e5\u30fc\u30c8\u30ea\u30a2\u30eb\u3067\u3082\u5b9f\u88c5\u3057\u307e\u3057\u305f\u304c\u3001\u3069\u306e\u3088\u3046\u306b\u30b9\u30af\u30ec\u30a4\u30d4\u30f3\u30b0\u3059\u308b\u304b\u3092\u5b9a\u7fa9\u3059\u308b\u4e2d\u5fc3\u7684\u306a\u30af\u30e9\u30b9\u3067\u3059\u3002Spider\u30af\u30e9\u30b9\u3060\u3051\u3067\u30b9\u30af\u30ec\u30a4\u30d4\u30f3\u30b0\u3092\u5b8c\u7d50\u3055\u305b\u308b\u3053\u3068\u3082\u3067\u304d\u307e\u3059\u3002<\/p>\n Spider\u30af\u30e9\u30b9\u3067\u306f\u3001\u4ee5\u4e0b\u306e\u6d41\u308c\u3067\u51e6\u7406\u304c\u5b9f\u884c\u3055\u308c\u307e\u3059<\/p>\n Item Pipeline\u306f\u3001\u53d6\u5f97\u3057\u305fResonse\u306b\u5bfe\u3057\u3066\u306a\u3093\u3089\u304b\u306e\u30a2\u30af\u30b7\u30e7\u30f3\u3092\u884c\u3044\u305f\u3044\u5834\u5408\u306b\u5229\u7528\u3059\u308b\u3082\u306e\u3067\u3059\u3002\u5229\u7528\u3057\u306a\u304f\u3066\u3082\u5168\u7136\u69cb\u3044\u307e\u305b\u3093\u3002 Pipeline\u30af\u30e9\u30b9\u306f \u500b\u4eba\u306e\u898b\u89e3\u3067\u3059\u304c\u3001Pipeline\u3092\u5229\u7528\u3059\u308b\u5834\u5408\u3001Web\u30ea\u30af\u30a8\u30b9\u30c8\u306b\u95a2\u3057\u3066\u306fSpider\u304c\u305d\u306e\u8cac\u52d9\u3092\u8ca0\u3044\u3001\u7d50\u679c\u306e\u30d5\u30a3\u30eb\u30bf\u30fc\u3084\u52a0\u5de5\u306b\u95a2\u3057\u3066\u306fPipeline\u304c\u8cac\u52d9\u3092\u8ca0\u3046\u3001\u3068\u3044\u3046\u3088\u3046\u306a\u8cac\u4efb\u5206\u62c5\u304c\u91cd\u8981\u3060\u3068\u601d\u3044\u307e\u3059\u3002 Items\u306fPipeline\u306b\u30c7\u30fc\u30bf\u3092\u6e21\u3059\u305f\u3081\u306e\u5165\u308c\u3082\u306e\u3067\u3059\u3002 \u4e0a\u306e\u30c1\u30e5\u30fc\u30c8\u30ea\u30a2\u30eb\u306e\u30b5\u30f3\u30d7\u30eb\u30b3\u30fc\u30c9\u306b\u3082Item\u306f\u51fa\u3066\u304d\u3066\u306a\u3044\u306e\u3067\u3059\u304c\u3001Item Pipeline\u3092\u4f7f\u308f\u306a\u3044\u306e\u3067\u3042\u308c\u3070Item\u306e\u5b9a\u7fa9\u306f\u7279\u306b\u5fc5\u8981\u3042\u308a\u307e\u305b\u3093\u3002<\/p>\n scrapy\u306f\u30b3\u30de\u30f3\u30c9\u5b9f\u884c\u3059\u308b\u306e\u304c\u57fa\u672c\u306e\u3088\u3046\u3067\u3059\u304c\u3001python\u30b9\u30af\u30ea\u30d7\u30c8\u304b\u3089\u5b9f\u884c\u3059\u308b\u3053\u3068\u3082\u3067\u304d\u307e\u3059\u3002 \u30dd\u30a4\u30f3\u30c8\u306f\u4ee5\u4e0b\u306e\u901a\u308a\u3067\u3059\u3002<\/p>\n \u3042\u3068\u306f\u4ee5\u4e0b\u306e\u3088\u3046\u306b\u5b9f\u884c\u3057\u307e\u3059\u3002<\/p>\n \u30c1\u30e5\u30fc\u30c8\u30ea\u30a2\u30eb\u30b3\u30fc\u30c9\u3068\u540c\u69d8\u306b \u3053\u306e\u3088\u3046\u306b\u3001\u7c21\u5358\u306bpython\u30b9\u30af\u30ea\u30d7\u30c8\u304b\u3089Scrapy\u3092\u5b9f\u884c\u3059\u308b\u3053\u3068\u304c\u3067\u304d\u307e\u3059\u3057\u3001\u8a2d\u5b9a\u3082Dict\u3067\u6e21\u305b\u308b\u3057\u3001\u5bfe\u8c61URL\u3082\u5f15\u6570\u3067\u6e21\u305b\u308b\u306e\u3067\u3001\u52d5\u7684\u306b\u30b9\u30af\u30ec\u30a4\u30d4\u30f3\u30b0\u8a2d\u5b9a\u30fb\u30b9\u30af\u30ec\u30a4\u30d4\u30f3\u30b0\u5bfe\u8c61\u3092\u5909\u66f4\u3059\u308b\u3053\u3068\u3082\u53ef\u80fd\u3067\u3059\u3002<\/p>\n \u500b\u4eba\u7684\u306b\u306f\u30b3\u30de\u30f3\u30c9\u30e9\u30a4\u30f3\u5b9f\u884c\u3088\u308a\u3053\u3061\u3089\u306e\u65b9\u304c\u67d4\u8edf\u6027\u304c\u9ad8\u304f\u597d\u304d\u3067\u3059\u3002\u306a\u306e\u3067\u3001\u3053\u308c\u4ee5\u964d\u306e\u5b9f\u88c5\u30b5\u30f3\u30d7\u30eb\u306f Response\u30aa\u30d6\u30b8\u30a7\u30af\u30c8\u306e\u30b9\u30c6\u30fc\u30bf\u30b9\u3068BODY\u30b5\u30a4\u30ba\u3067\u30d5\u30a3\u30eb\u30bf\u30fc\u3057\u3066\u3001OK\u3060\u3063\u305f\u3089\u30d5\u30a1\u30a4\u30eb\u306b\u51fa\u529b\u3059\u308b\u3068\u3044\u3046Pipeline\u3092\u4f5c\u3063\u3066\u307f\u307e\u3057\u305f\u3002<\/p>\n \u307e\u305a\u306f\u3001 \u6b21\u306b\u3001 \u5f8c\u306f\u3001 \u30dd\u30a4\u30f3\u30c8\u306f\u4ee5\u4e0b\u3067\u3059\u3002<\/p>\n \u3042\u3068\u306f\u4ee5\u4e0b\u306e\u3088\u3046\u306b\u5b9f\u884c\u3057\u307e\u3059\u3002<\/p>\n BODY\u30b5\u30a4\u30ba\u306e\u30d5\u30a3\u30eb\u30bf\u30fc\u306e\u7d50\u679c\u30013\u3064\u306eURL\u306e\u3046\u3061 Item Pipeline\u306f\u3053\u3093\u306a\u611f\u3058\u3067\u5229\u7528\u3057\u307e\u3059\u3002 \u3042\u308b\u30da\u30fc\u30b8\u3092\u30b9\u30af\u30ec\u30a4\u30d4\u30f3\u30b0\u3057\u3066Item\u3092\u53d6\u5f97\u3057\u3064\u3064\u3001\u30da\u30fc\u30b8\u5185\u306e\u30ea\u30f3\u30af\u3092\u53d6\u5f97\u3057\u3066\u6b21\u306e\u30ea\u30af\u30a8\u30b9\u30c8\u3092\u884c\u3046\u3088\u3046\u306a\u3053\u3068\u3092\u3084\u308a\u305f\u3044\u30b1\u30fc\u30b9\u306f\u3042\u308b\u3068\u601d\u3044\u307e\u3059\u3002\u305d\u306e\u3088\u3046\u306a\u30b1\u30fc\u30b9\u3067\u306fSpider\u30af\u30e9\u30b9\u306e Item-Pipeline\u3092\u4f7f\u3046<\/a>\u306e\u30b1\u30fc\u30b9\u3067Response\u3092parse\u3059\u308b\u969b\u306b\u3001NEXT\u30dc\u30bf\u30f3\u306e\u30ea\u30f3\u30afURL\u3092\u53d6\u5f97\u3057\u3066\u6b21\u306eRequest\u3092\u4f5c\u308a\u3064\u3064\u3001\u8a72\u5f53\u30da\u30fc\u30b8\u306eResponse\u81ea\u4f53\u306fMyItem\u306b\u8a70\u3081\u3066Pipeline\u306b\u6d41\u3059\u3088\u3046\u306a\u611f\u3058\u306b\u3057\u3088\u3046\u3068\u601d\u3044\u307e\u3059\u3002<\/p>\n \u5b9f\u88c5\u306f\u3001Item-Pipeline\u3092\u4f7f\u3046<\/a>\u306e \u5b9f\u884c\u3059\u308b\u3068\u3001 Scrapy\u306b\u306f\u30ea\u30f3\u30af\u3092\u305f\u3069\u3063\u3066URL\u3092\u62bd\u51fa\u3059\u308b\u3053\u3068\u306b\u7279\u5316\u3057\u305f\u3001LinkExtractor\u30af\u30e9\u30b9\u304c\u7528\u610f\u3055\u308c\u3066\u3044\u307e\u3059\u3002\u3053\u308c\u3092\u4f7f\u3063\u3066\u3001\u30ea\u30f3\u30afURL\u3092\u30ed\u30b0\u306b\u51fa\u529b\u3059\u308b\u3060\u3051\u306e\u30b5\u30f3\u30d7\u30eb\u3092\u66f8\u3044\u3066\u307f\u3088\u3046\u3068\u601d\u3044\u307e\u3059\u3002<\/p>\n \u3053\u308c\u3082 \u5b9f\u884c\u3059\u308b\u3068\u4ee5\u4e0b\u306e\u3088\u3046\u306b\u7121\u4e8b\u30ea\u30f3\u30afURL\u304c\u30ed\u30b0\u306b\u51fa\u529b\u3055\u308c\u307e\u3057\u305f\u3002<\/p>\n \u30b5\u30a4\u30c8\u30de\u30c3\u30d7\u30d5\u30a1\u30a4\u30eb\uff08sitemap.xml\u3084robots.txt\uff09\u304b\u3089SitemapSpider\u3092\u4f7f\u3063\u3066\u30b5\u30a4\u30c8\u5185\u306eURL\u4e00\u89a7\u3092\u53d6\u5f97\u3059\u308b\u3053\u3068\u3082\u3067\u304d\u307e\u3059\u3002 \u5b9f\u969b\u306b\u52d5\u304b\u3057\u3066\u307f\u308b\u3068\u3001\u4ee5\u4e0b\u306e\u3088\u3046\u306a\u611f\u3058\u3067\u5927\u91cf\u306bURL\u3092\u62bd\u51fa\u3059\u308b\u3053\u3068\u304c\u3067\u304d\u307e\u3057\u305f\u3002<\/p>\n \u30b5\u30a4\u30c8\u30de\u30c3\u30d7\u304b\u3089\u62bd\u51fa\u3059\u308b\u306e\u3067\u3042\u308c\u3070\u3001\u76f8\u5f53\u901f\u3044\u306e\u304b\u3068\u671f\u5f85\u3057\u3066\u305f\u306e\u3067\u3059\u304c\u3001\u5168\u3066\u306eURL\u306bRequest\u3092\u6295\u3052\u308b\u306e\u3067\u3001\u666e\u901a\u306bLinkExtractor\u3068\u540c\u3058\u30ec\u30d9\u30eb\u3067\u6642\u9593\u304c\u304b\u304b\u308a\u307e\u3059\u3002\u307e\u305f\u3001\u30b5\u30a4\u30c8\u30de\u30c3\u30d7\u304c\u6700\u65b0\u306b\u66f4\u65b0\u3055\u308c\u3066\u3044\u308b\u3068\u3044\u3046\u4fdd\u8a3c\u3082\u306a\u3044\u305f\u3081\u3001\u3053\u308c\u3060\u3051\u306b\u983c\u308b\u306e\u306f\u5371\u967a\u305d\u3046\u3067\u3059\u3002<\/p>\n \u306e\u3088\u3046\u306b\u5b9f\u884c\u3059\u308b\u3068\u3001REPL\u304c\u7acb\u3061\u4e0a\u304c\u3063\u3066\u3001\u30b3\u30fc\u30c9\u65ad\u7247\u3092\u5b9f\u884c\u3059\u308b\u3053\u3068\u304c\u3067\u304d\u308b\u3088\u3046\u306b\u306a\u308a\u307e\u3059\u3002<\/p>\n \u305f\u3068\u3048\u3070\u3001 \u306a\u306e\u3067\u3001\u3053\u308c\u3092\u898b\u306a\u304c\u3089\u4ee5\u4e0b\u306e\u3088\u3046\u306a\u611f\u3058\u3067\u6c17\u8efd\u306b\u8a66\u305b\u308b\u306e\u304c\u826f\u3044\u611f\u3058\u3067\u3059\u3002<\/p>\n \u8a66\u3057\u3066\u306a\u3044\u306e\u3067\u3059\u304c\u3001scrapy-splash\u3092\u5229\u7528\u3059\u308b\u3068\u3001\u30d8\u30c3\u30c9\u30ec\u30b9\u30d6\u30e9\u30a6\u30b6\u4e0a\u3067\u5bfe\u8c61\u30da\u30fc\u30b8\u3092\u5b9f\u884c\u3059\u308b\u3053\u3068\u304c\u3067\u304d\u308b\u3088\u3046\u3067\u3059\u3002 README\u3092\u8efd\u304f\u8aad\u3093\u3060\u3060\u3051\u3067\u3059\u304c\u3001 JavaScript\u30b3\u30fc\u30c9\u5185\u3067\u306f\u5f53\u7136\u306a\u304c\u3089 \u8907\u6570\u306e\u753b\u9762\u3092\u307e\u305f\u3044\u3067\u753b\u9762\u5185\u306e\u30a8\u30ec\u30e1\u30f3\u30c8\u3092\u30af\u30ea\u30c3\u30af\u3057\u306a\u304c\u3089\u753b\u9762\u9077\u79fb\u3059\u308b\u3088\u3046\u306a\u3053\u3068\u306f\u304a\u305d\u3089\u304f\u3067\u304d\u306a\u3044\u306e\u3067\u3001\u904e\u5ea6\u306a\u671f\u5f85\u306f\u3067\u304d\u306a\u3044\u304b\u306a\u3001\u3001\u3068\u3044\u3046\u5370\u8c61\u3067\u3059\u3002\u30d8\u30c3\u30c9\u30ec\u30b9\u30d6\u30e9\u30a6\u30b6\u304c\u5fc5\u8981\u3067\u3042\u308c\u3070\u81ea\u5206\u306a\u3089\u7121\u96e3\u306bpyppeteer\u3092\u5229\u7528\u3059\u308b\u3068\u601d\u3044\u307e\u3059\u3002<\/p>\n CrawlerProcess(CrawlerRunner)\u3092\u4f7f\u3063\u3066\u30af\u30ed\u30fc\u30ea\u30f3\u30b0\u3059\u308b\u5834\u5408\u3001 \u4ee5\u4e0b\u306e\u3088\u3046\u306a\u30b5\u30f3\u30d7\u30eb\u30d7\u30ed\u30b0\u30e9\u30e0\u3092\u4f5c\u3063\u3066\u3001\u4e8c\u56de\u30af\u30ed\u30fc\u30ea\u30f3\u30b0\u3092\u5b9f\u884c\u3059\u308b\u3068<\/p>\n \u4ee5\u4e0b\u306e\u3088\u3046\u306b \u7d50\u8ad6\u304b\u3089\u8a00\u3046\u3068\u3001\u3053\u308c\u3092\u56de\u907f\u3059\u308b\u65b9\u6cd5\u306f\u3001\u5225\u30d7\u30ed\u30bb\u30b9\u3067\u5b9f\u884c\u3059\u308b\u65b9\u6cd5\u3057\u304b\u3046\u307e\u304f\u3044\u304d\u307e\u305b\u3093\u3067\u3057\u305f\u3002 \uff08 \u3061\u306a\u307f\u306b\u3001\u81ea\u5206\u306fPySpark\u306eworker\u30ce\u30fc\u30c9\u4e0a\u3067\u30af\u30ed\u30fc\u30ea\u30f3\u30b0\u51e6\u7406\u3092\u5b9f\u884c\u3057\u305f\u969b\u306b\u3001\u3053\u306e\u554f\u984c\u306b\u906d\u9047\u3057\u307e\u3057\u305f\u3002worker\u30ce\u30fc\u30c9\u4e0a\u3067\u52d5\u304fTaskExecutor\u306f\u540c\u4e00\u30d7\u30ed\u30bb\u30b9\u3067\u8981\u6c42\u3055\u308c\u305f\u30bf\u30b9\u30af\u3092\u9806\u6b21\u51e6\u7406\u3059\u308b\u305f\u3081\u3001\u30af\u30ed\u30fc\u30ea\u30f3\u30b0\u51e6\u7406\u3092\u542b\u3080\u30bf\u30b9\u30af\u30922\u56de\u76ee\u306b\u5b9f\u884c\u3057\u305f\u969b\u306b\u3053\u306e\u30a8\u30e9\u30fc\u304c\u767a\u751f\u3057\u307e\u3057\u305f\u3002\u3002\u3002<\/p>\n \u4eca\u56de\u691c\u8a3c\u3057\u305f\u30bd\u30fc\u30b9\u30b3\u30fc\u30c9\u306f\u5168\u3066github\u306b\u3042\u3052\u3066\u3042\u308a\u307e\u3059\u306e\u3067\u3001\u5fc5\u8981\u304c\u3042\u308c\u3070\u3053\u3061\u3089<\/a>\u3092\u53c2\u7167\u304f\u3060\u3055\u3044\u3002<\/p>\n \u30c9\u30ad\u30e5\u30e1\u30f3\u30c8\u3092\u898b\u306a\u304c\u3089\u8272\u3005\u3068\u8a66\u3057\u3066\u307f\u307e\u3057\u305f\u304c\u3001\u5b9f\u88c5\u81ea\u4f53\u306f\u7d50\u69cb\u7c21\u5358\u3067\u3057\u305f\u3002 \u500b\u4eba\u7684\u306b\u306f\u3001requests+BeautifulSoup\u3084requests_html\u306e\u3088\u3046\u306a\u30b3\u30fc\u30c9\u3092\u898b\u308c\u3070\u5206\u304b\u308b\u30e9\u30a4\u30d6\u30e9\u30ea\u3068Scrapy\u306e\u3069\u3061\u3089\u304c\u63a1\u7528\u3059\u308b\uff1f\u805e\u304b\u308c\u308b\u3068\u3001\u6b63\u76f4\u304b\u306a\u308a\u8ff7\u3046\u3068\u601d\u3044\u307e\u3059\u3002\u305f\u3076\u3093\u3001LinkExtractor\u3084SitemapSpider\u306e\u3088\u3046\u306b\u76ee\u7684\u306b\u3070\u3063\u3061\u308a\u5408\u81f4\u3059\u308b\u30b1\u30fc\u30b9\u3067\u306fScrapy\u3092\u5229\u7528\u3057\u3001\u305d\u308c\u4ee5\u5916\u306f\u4ed6\u306e\u30e9\u30a4\u30d6\u30e9\u30ea\u3092\u9078\u629e\u3059\u308b\u6c17\u304c\u3057\u307e\u3059\u3002\u3002<\/p>\n","protected":false},"excerpt":{"rendered":" Scrapy \u306f\u3001Web\u30b5\u30a4\u30c8\u3092\u30af\u30ed\u30fc\u30eb\u3057\u3001\u30da\u30fc\u30b8\u304b\u3089\u69cb\u9020\u5316\u30c7\u30fc\u30bf\u3092\u62bd\u51fa\u3059\u308b\u305f\u3081\u306b\u4f7f\u7528\u3055\u308c\u308bWeb\u30b9\u30af\u30ec\u30a4\u30d4\u30f3\u30b0\u30d5\u30ec\u30fc\u30e0\u30ef\u30fc\u30af\u3067\u3059\u3002Scrapy\u306b\u95a2\u3057\u3066\u306f\u308f\u304b\u308a\u3084\u3059\u3044\u8a18\u4e8b\u304c\u305f\u304f\u3055\u3093\u3042\u308b\u306e\u3067\u3001\u3053\u3053\u3067\u306f\u5b9f\u88c5\u30b5\u30f3\u30d7\u30eb\u3092\u7d39\u4ecb\u3057\u307e\u304f\u308b\u30b9\u30bf\u30f3\u30b9\u306b\u3057\u3088\u3046\u3068\u601d\u3044\u307e\u3059\u3002 \u30a4\u30f3\u30b9\u30c8\u30fc\u30eb pip install scrapy # or poetry add scrapy \u30c1\u30e5\u30fc\u30c8\u30ea\u30a2\u30eb\u3092\u8a66\u3059 \u3053\u3061\u3089\u306b\u3057\u305f\u304c\u3063\u3066\u3001\u30c1\u30e5\u30fc\u30c8\u30ea\u30a2\u30eb\u3092\u8a66\u3057\u3066\u307f\u307e\u3059\u3002 scrapy startproject tutorial or poetry run scrapy startproject tutorial \u3092\u5b9f\u884c\u3059\u308b\u3068tutorial\u30d5\u30a9\u30eb\u30c0\u304c\u3067\u304d\u3066\u305d\u306e\u4e0b\u306b\u30c6\u30f3\u30d7\u30ec\u30fc\u30c8\u306e\u30bd\u30fc\u30b9\u30b3\u30fc\u30c9\u4e00\u5f0f\u304c\u51fa\u529b\u3055\u308c\u307e\u3059\u3002 tutorial\/spiders\u30d5\u30a9\u30eb\u30c0\u306e\u4e0b\u306b\u4ee5\u4e0b\u306e\u5185\u5bb9\u3067quotes_spider.py\u3092\u4f5c\u308a\u307e\u3059\u3002 import scrapy <\/span>Continue Reading<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[7,9],"tags":[],"_links":{"self":[{"href":"http:\/\/localhost:8000\/wp-json\/wp\/v2\/posts\/209"}],"collection":[{"href":"http:\/\/localhost:8000\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/localhost:8000\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/localhost:8000\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/localhost:8000\/wp-json\/wp\/v2\/comments?post=209"}],"version-history":[{"count":1,"href":"http:\/\/localhost:8000\/wp-json\/wp\/v2\/posts\/209\/revisions"}],"predecessor-version":[{"id":211,"href":"http:\/\/localhost:8000\/wp-json\/wp\/v2\/posts\/209\/revisions\/211"}],"wp:attachment":[{"href":"http:\/\/localhost:8000\/wp-json\/wp\/v2\/media?parent=209"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/localhost:8000\/wp-json\/wp\/v2\/categories?post=209"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/localhost:8000\/wp-json\/wp\/v2\/tags?post=209"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}
\n<\/p>\n\u30a4\u30f3\u30b9\u30c8\u30fc\u30eb<\/h2>\n
pip install scrapy\n# or\npoetry add scrapy<\/code><\/pre>\n
\u30c1\u30e5\u30fc\u30c8\u30ea\u30a2\u30eb\u3092\u8a66\u3059<\/h2>\n
scrapy startproject tutorial\nor\npoetry run scrapy startproject tutorial<\/code><\/pre>\n
tutorial<\/code>\u30d5\u30a9\u30eb\u30c0\u304c\u3067\u304d\u3066\u305d\u306e\u4e0b\u306b\u30c6\u30f3\u30d7\u30ec\u30fc\u30c8\u306e\u30bd\u30fc\u30b9\u30b3\u30fc\u30c9\u4e00\u5f0f\u304c\u51fa\u529b\u3055\u308c\u307e\u3059\u3002<\/p>\n
tutorial\/spiders<\/code>\u30d5\u30a9\u30eb\u30c0\u306e\u4e0b\u306b\u4ee5\u4e0b\u306e\u5185\u5bb9\u3067
quotes_spider.py<\/code>\u3092\u4f5c\u308a\u307e\u3059\u3002<\/p>\n
import scrapy\nfrom typing import List\n\nclass QuotesSpider(scrapy.Spider):\n name: str = "quotes"\n\n def start_requests(self):\n urls: List[str] = [\n 'http:\/\/quotes.toscrape.com\/page\/1\/',\n 'http:\/\/quotes.toscrape.com\/page\/2\/',\n ]\n for url in urls:\n yield scrapy.Request(url=url, callback=self.parse)\n\n def parse(self, response):\n page: str = response.url.split("\/")[-2]\n filename: str = 'quotes-%s.html' % page\n with open(filename, 'wb') as f:\n f.write(response.body)\n self.log('Saved file %s' % filename)<\/code><\/pre>\n
scrapy crawl quotes\nor\npoetry run scrapy crawl quotes<\/code><\/pre>\n
quotes-1.html<\/code>\u3068
quotes-2.html<\/code>\u304c\u51fa\u529b\u3055\u308c\u3066\u3044\u308c\u3070\u6210\u529f\u3067\u3059\u3002<\/p>\n
Scrapy\u306e\u69cb\u6210\u8981\u7d20<\/h2>\n
Spiders<\/a><\/h3>\n
\n
start_requests()<\/code>\u3067\u6700\u521d\u306eRequest\u30aa\u30d6\u30b8\u30a7\u30af\u30c8\uff08\u306e\u30ea\u30b9\u30c8\uff09\u3092\u8fd4\u5374\u3057\u307e\u3059<\/li>\n
parse()<\/code>\u304c\u547c\u3073\u51fa\u3055\u308c\u308b\u3002\u5f15\u6570\u306bResponse\u30aa\u30d6\u30b8\u30a7\u30af\u30c8\u304c\u6e21\u3063\u3066\u304f\u308b<\/li>\n
parse()<\/code>\u3067\u306f\u3001Response\u30aa\u30d6\u30b8\u30a7\u30af\u30c8\u3092\u5143\u306b\u51e6\u7406\u3092\u884c\u3044\uff08\u4f8b\u3048\u3070\u30d5\u30a1\u30a4\u30eb\u3092\u51fa\u529b\u3057\u305f\u308a\u3068\u304b\u3001DB\u306b\u8a18\u9332\u3057\u305f\u308a\u3068\u304b\uff09\u3001\u6b21\u306eRequest\u30aa\u30d6\u30b8\u30a7\u30af\u30c8\uff08or \u30ea\u30b9\u30c8\uff09\u304bItem\u30aa\u30d6\u30b8\u30a7\u30af\u30c8\u3092\u8fd4\u5374\u3059\u308b<\/li>\n
parse()<\/code>\u304c\u547c\u3073\u51fa\u3055\u308c\u308b<\/li>\n
Item Pipeline<\/a><\/h3>\n
\n\u4e00\u822c\u7684\u306a\u7528\u9014\u306f\u3001\u4ee5\u4e0b\u306e\u901a\u308a\u3068\u306e\u3053\u3068\u3067\u3059\u3002<\/p>\n\n
process_item(self, item, spider)<\/code>\u3068\u3044\u3046\u95a2\u6570\u3092\u6301\u3064\u30b7\u30f3\u30d7\u30eb\u306a\u30af\u30e9\u30b9\u3067\u3059\u3002
\nprocess_item()\u3067\u306f\u4f55\u3089\u304b\u306e\u30a2\u30af\u30b7\u30e7\u30f3\u3092\u884c\u3063\u305f\u5f8c\u306b\u3001Item\u30aa\u30d6\u30b8\u30a7\u30af\u30c8 or Deferred<\/code>\u3092\u8fd4\u5374\u3059\u308b\u304b\u3001
DropItem<\/code>\u30a8\u30e9\u30fc\u3092raise\u3059\u308b\u304b\u3001\u306e\u3044\u305a\u308c\u304b\u306e\u51e6\u7406\u3092\u884c\u3046\u5fc5\u8981\u304c\u3042\u308a\u307e\u3059\u3002<\/p>\n
\n\u306a\u306e\u3067\u3001\u4f55\u3089\u304b\u306e\u30a2\u30af\u30b7\u30e7\u30f3\u3092\u3057\u305f\u7d50\u679c\u3067\u65b0\u3057\u304fRequest\u3092\u6295\u3052\u305f\u304f\u306a\u308b\u3053\u3068\u304c\u3042\u308b\u304b\u3082\u3057\u308c\u307e\u305b\u3093\u304c\u3001\u304a\u305d\u3089\u304fPipeline\u5b9f\u88c5\u3059\u3079\u304d\u3067\u306f\u306a\u304f\u3001Spider\u3067\u5b9f\u88c5\u3059\u3079\u304d\u3060\u3068\u601d\u3044\u307e\u3059\u3002
\n\u4e0b\u306e\u65b9\u306b\u3053\u308c\u306b\u8a72\u5f53\u3059\u308b\u5b9f\u88c5\u30b5\u30f3\u30d7\u30eb\u304c\u3042\u308b\u306e\u3067\u898b\u3066\u307f\u3066\u304f\u3060\u3055\u3044\u3002<\/p>\nItems<\/a><\/h3>\n
dictionaries<\/code>\u30fb
Item objects<\/code>\u30fb
dataclass objects<\/code>\u30fb
attrs objects<\/code>\u306a\u3069\u3044\u304f\u3064\u304b\u306e\u7a2e\u985e\u304c\u3042\u308b\u305d\u3046\u3067\u3059\u3002<\/p>\n
\n
dict<\/code><\/li>\n
scrapy.item.Item<\/code>\u3092\u7d99\u627f\u3057\u305f\u30af\u30e9\u30b9\u306e\u30aa\u30d6\u30b8\u30a7\u30af\u30c8<\/li>\n
dataclass<\/code>\u3092\u4f7f\u3063\u3066\u5b9a\u7fa9\u3057\u305f\u30af\u30e9\u30b9\u306e\u30aa\u30d6\u30b8\u30a7\u30af\u30c8<\/li>\n
attr.s<\/code>\u3092\u4f7f\u3063\u3066\u5b9a\u7fa9\u3057\u305f\u30af\u30e9\u30b9\u306e\u30aa\u30d6\u30b8\u30a7\u30af\u30c8<\/li>\n<\/ul>\n
\u5b9f\u88c5\u30b5\u30f3\u30d7\u30eb<\/h2>\n
python\u30b9\u30af\u30ea\u30d7\u30c8\u304b\u3089\u547c\u3073\u51fa\u3059<\/h3>\n
\n\u30c1\u30e5\u30fc\u30c8\u30ea\u30a2\u30eb\u30b3\u30fc\u30c9\u3092\u5c11\u3057\u5909\u66f4\u3057\u3066\u3001\u6307\u5b9a\u3057\u305fURL\u306eHTML\u3092\u62bd\u51fa\u3057\u3066\u30d5\u30a1\u30a4\u30eb\u306b\u51fa\u529b\u3059\u308b\u30b5\u30f3\u30d7\u30eb\u3092\u66f8\u3044\u3066\u307f\u3088\u3046\u3068\u601d\u3044\u307e\u3059\u3002<\/p>\nmain.py<\/code>\u3092\u4f5c\u3063\u3066\u3001\u305d\u306e\u4e2d\u3067
MySpider<\/code>\u30af\u30e9\u30b9\u3092\u4f5c\u6210\u3057\u3001
CrawlerProcess<\/code>\u3067Spider\u30af\u30e9\u30b9\u3092\u5b9f\u884c\u3059\u308b\u51e6\u7406\u3092\u66f8\u304d\u307e\u3057\u305f\u3002<\/p>\n
import scrapy\nfrom scrapy.http import Response\nfrom scrapy.crawler import CrawlerProcess\nfrom typing import List, Dict, Any\n\nclass MySpider(scrapy.Spider):\n name = 'my_spider'\n\n def __init__(self, urls: List[str], *args, **kwargs):\n # Request\u5bfe\u8c61\u306eURL\u3092\u6307\u5b9a\n self.start_urls = urls\n super(MySpider, self).__init__(*args, **kwargs)\n\n def parse(self, response: Response):\n page: str = response.url.split("\/")[-2]\n filename: str = 'quotes-%s.html' % page\n with open(filename, 'wb') as f:\n f.write(response.body)\n self.log('Saved file %s' % filename)\n\ndef main():\n # \u30b9\u30af\u30ec\u30a4\u30d4\u30f3\u30b0\u8a2d\u5b9a see: https:\/\/docs.scrapy.org\/en\/latest\/topics\/settings.html\n settings: Dict[str, Any] = {\n 'DOWNLOAD_DELAY': 3,\n 'TELNETCONSOLE_ENABLED': False,\n }\n\n # \u30af\u30ed\u30fc\u30ea\u30f3\u30b0\u5b9f\u884c\n process: CrawlerProcess = CrawlerProcess(settings=settings)\n process.crawl(MySpider, ['http:\/\/quotes.toscrape.com\/page\/1\/', 'http:\/\/quotes.toscrape.com\/page\/2\/'])\n process.start() # the script will block here until the crawling is finished\n\nif __name__ == "__main__":\n main()<\/code><\/pre>\n
\n
__init__()<\/code>\uff09\u306b
urls<\/code>\u3068\u3044\u3046\u5f15\u6570\u3092\u8ffd\u52a0\u3057\u3066\u304a\u308a\u3001\u3053\u306e
urls<\/code>\u3092\u30a4\u30f3\u30b9\u30bf\u30f3\u30b9\u5909\u6570
start_urls<\/code>\u306b\u8a2d\u5b9a\u3059\u308b\u3053\u3068\u3067\u3001\u30ea\u30af\u30a8\u30b9\u30c8\u5bfe\u8c61\u306eURL\u3068\u3057\u3066\u8a2d\u5b9a\u3057\u3066\u3044\u308b<\/li>\n
parse()<\/code>\u306e\u4e2d\u8eab\u306f\u30c1\u30e5\u30fc\u30c8\u30ea\u30a2\u30eb\u30b3\u30fc\u30c9\u3068\u5168\u304f\u540c\u3058<\/li>\n
main()<\/code>\u95a2\u6570\u306e\u4e2d\u3067\u307e\u305a\u30b9\u30af\u30ec\u30a4\u30d4\u30f3\u30b0\u8a2d\u5b9a\u3092Dict\u3067\u5b9a\u7fa9\u3057\u3066\u3044\u308b\u3002\u3053\u308c\u306f\u30c6\u30f3\u30d7\u30ec\u30fc\u30c8\u3060\u3068
settings.py<\/code>\u3067\u5b9a\u7fa9\u3057\u3066\u3044\u308b\u5185\u5bb9\u3092\u79fb\u690d\u3057\u305f\u3082\u306e\n
\n
CrawlerProcess<\/code>\u3082\u3057\u304f\u306f
CrawlerRunner<\/code>\u3067\u5b9f\u884c\u3067\u304d\u308b\u304c\u4eca\u56de\u306f
CrawlerProcess<\/code>\u3092\u4f7f\u3063\u3066\u3044\u308b\uff08\u6b63\u76f4\u4f7f\u3044\u5206\u3051\u304c\u3088\u304f\u308f\u304b\u3089\u306a\u3044\uff09<\/li>\n
process.crawl()<\/code>\u306e\u7b2c\u4e8c\u5f15\u6570\u4ee5\u964d\u304c
Spider\u30af\u30e9\u30b9<\/code>\u306e\u30b3\u30f3\u30b9\u30c8\u30e9\u30af\u30bf\u306b\u5f15\u6570\u3068\u3057\u3066\u6e21\u3055\u308c\u308b\n
\n
urls=['http:\/\/hoge', 'http:\/\/fuga']<\/code>\uff09<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n
python main.py<\/code><\/pre>\n
quotes-1.html<\/code>\u3068
quotes-2.html<\/code>\u304c\u51fa\u529b\u3055\u308c\u308b\u3068\u601d\u3044\u307e\u3059\u3002<\/p>\n
CrawlerProcess<\/code>\u3092\u4f7f\u3063\u3066Spider\u3092\u5b9f\u884c\u3059\u308b\u30b9\u30bf\u30f3\u30b9\u3067\u66f8\u3044\u3066\u3044\u304d\u305f\u3044\u3068\u601d\u3044\u307e\u3059\u3002<\/p>\n
Item Pipeline\u3092\u4f7f\u3046<\/h3>\n
items.py<\/code>\u3092\u4f5c\u3063\u3066
MyItem<\/code>\u30af\u30e9\u30b9\u3092\u5b9a\u7fa9\u3057\u307e\u3059\u3002\u4eca\u56de\u306f
dataclass<\/code>\u3092\u4f7f\u3063\u3066\u5b9a\u7fa9\u3057\u3066\u307f\u307e\u3057\u305f\u3002<\/p>\n
from dataclasses import dataclass\n\n@dataclass\nclass MyItem:\n url: str\n status: int\n title: str\n body: str<\/code><\/pre>\n
pipelines.py<\/code>\u3092\u4f5c\u3063\u3066\u3001
MyItem<\/code>\u30af\u30e9\u30b9\u3092\u53d7\u3051\u53d6\u3063\u3066\u30d5\u30a3\u30eb\u30bf\u30fc\u3057\u3066\u3001\u30d5\u30a1\u30a4\u30eb\u3092\u51fa\u529b\u3059\u308bPipeline\u3092\u5b9a\u7fa9\u3057\u307e\u3059\u3002
\nPipeline\u30af\u30e9\u30b9\u306fprocess_time<\/code>\u3068\u3044\u3046\u540d\u524d\u306e\u95a2\u6570\u304c\u5fc5\u9808\u3067\u3001\u7b2c\u4e8c\u5f15\u6570\u306bItem\u30aa\u30d6\u30b8\u30a7\u30af\u30c8\u3001\u7b2c\u4e09\u5f15\u6570\u306bSpider\u30aa\u30d6\u30b8\u30a7\u30af\u30c8\u3092\u53d7\u3051\u53d6\u308a\u3001Item\u30aa\u30d6\u30b8\u30a7\u30af\u30c8 or defer \u3092\u8fd4\u5374\u3059\u308b\u304b\u3001DropItem\u3092raise\u3059\u308b\u5fc5\u8981\u304c\u3042\u308a\u307e\u3059\u3002<\/p>\n
from scrapy.exceptions import DropItem\nfrom scrapy import Spider\nfrom items import MyItem\n\nclass StatusFilterPipeline:\n """\u30b9\u30c6\u30fc\u30bf\u30b9\u3067\u30d5\u30a3\u30eb\u30bf\u30fc\u3059\u308b\u30d1\u30a4\u30d7\u30e9\u30a4\u30f3"""\n def process_item(self, item: MyItem, spider: Spider) -> MyItem:\n if item.status != 200:\n raise DropItem(f'Status is not 200. status: {item.status}')\n return item\n\nclass BodyLengthFilterPipeline:\n """BODY\u30b5\u30a4\u30ba\u3067\u30d5\u30a3\u30eb\u30bf\u30fc\u3059\u308b\u30d1\u30a4\u30d7\u30e9\u30a4\u30f3"""\n def process_item(self, item: MyItem, spider: Spider) -> MyItem:\n if len(item.body) < 11000:\n raise DropItem(f'Body length less than 11000. body_length: {len(item.body)}')\n return item\n\nclass OutputFilePipeline:\n """\u30d5\u30a1\u30a4\u30eb\u51fa\u529b\u3059\u308b\u30d1\u30a4\u30d7\u30e9\u30a4\u30f3"""\n def process_item(self, item: MyItem, spider: Spider):\n filename: str = f'quotes-{item.url.split("\/")[-2]}.html'\n with open(filename, 'wb') as f:\n f.write(item.body)<\/code><\/pre>\n
main.py<\/code>\u3092\u4f5c\u3063\u3066
MySpider<\/code>\u30af\u30e9\u30b9\u3092\u4f5c\u308a\u3001
CrawlerProcess<\/code>\u3067\u5b9f\u884c\u3057\u307e\u3059\u3002<\/p>\n
from scrapy import Spider\nfrom scrapy.http import Response\nfrom scrapy.crawler import CrawlerProcess\nfrom typing import List, Dict, Any, Iterator\nfrom items import MyItem\n\nclass MySpider(Spider):\n name = 'my_spider'\n\n def __init__(self, urls: List[str], *args, **kwargs):\n self.start_urls = urls\n super(MySpider, self).__init__(*args, **kwargs)\n\n def parse(self, response: Response) -> Iterator[MyItem]:\n yield MyItem(\n url=response.url,\n status=response.status,\n title=response.xpath('\/\/title\/text()').extract_first(),\n body=response.body,\n )\n\ndef main():\n # \u30b9\u30af\u30ec\u30a4\u30d4\u30f3\u30b0\u8a2d\u5b9a see: https:\/\/docs.scrapy.org\/en\/latest\/topics\/settings.html\n settings: Dict[str, Any] = {\n 'DOWNLOAD_DELAY': 3,\n 'TELNETCONSOLE_ENABLED': False,\n 'ITEM_PIPELINES': {\n 'pipelines.StatusFilterPipeline': 100,\n 'pipelines.BodyLengthFilterPipeline': 200,\n 'pipelines.OutputFilePipeline': 300,\n },\n }\n\n # \u30af\u30ed\u30fc\u30ea\u30f3\u30b0\u5b9f\u884c\n process: CrawlerProcess = CrawlerProcess(settings=settings)\n process.crawl(MySpider, ['http:\/\/quotes.toscrape.com\/page\/1\/', 'http:\/\/quotes.toscrape.com\/page\/2\/', 'http:\/\/quotes.toscrape.com\/page\/3\/'])\n process.start() # the script will block here until the crawling is finished\n\nif __name__ == "__main__":\n main()<\/code><\/pre>\n
\n
parse()<\/code>\u95a2\u6570\u306f\u3001\u30b8\u30a7\u30cd\u30ec\u30fc\u30bf\u30fc\u95a2\u6570\u306b\u306a\u3063\u3066\u3044\u3066\u3001
MyItem<\/code>\u30aa\u30d6\u30b8\u30a7\u30af\u30c8\u3092yield\u3057\u3066\u3044\u308b<\/li>\n
settings<\/code>\u306e
ITEM_PIPELINES<\/code>\u3067Pipeline\u306e\u9806\u756a\u3092\u5b9a\u7fa9\u3057\u3066\u3044\u308b\u3002\u3053\u3053\u306b\u306fPipeline\u30af\u30e9\u30b9\u306e\u30d1\u30b9\u3092\u6307\u5b9a\u3059\u308b\u5fc5\u8981\u304c\u3042\u308b\u3002\u307e\u305f\u6570\u5b57\uff080-1000\uff09\u306e\u9806\u306b\u30d1\u30a4\u30d7\u30e9\u30a4\u30f3\u306f\u5b9f\u884c\u3055\u308c\u308b<\/li>\n<\/ul>\n
python main.py<\/code><\/pre>\n
quotes-1.html<\/code>\u3068
quotes-2.html<\/code>\u306e\u4e8c\u3064\u306e\u30d5\u30a1\u30a4\u30eb\u304c\u51fa\u529b\u3055\u308c\u3066\u3044\u308c\u3070\u6210\u529f\u3067\u3059\u3002
\n\u3061\u306a\u307f\u306b\u3001\u30d5\u30a3\u30eb\u30bf\u30fc\u3067DropItem<\/code>\u3092raise\u3057\u3066\u9664\u5916\u3057\u305fItem\u306b\u3064\u3044\u3066\u306f\u3001\u4ee5\u4e0b\u306e\u3088\u3046\u306bWARNING\u30ed\u30b0\u304c\u51fa\u529b\u3055\u308c\u3066\u3044\u307e\u3059\u3002<\/p>\n
2020-08-13 19:11:28 [scrapy.core.scraper] WARNING: Dropped: Body length less than 11000. body_length: 10018<\/code><\/pre>\n
\nPipeline\u306b\u3088\u3063\u3066\u8cac\u52d9\u304c\u660e\u78ba\u306b\u306a\u308b\u306e\u304c\u826f\u3055\u305d\u3046\u3067\u3059\u304c\u3001\u7406\u89e3\u3059\u3079\u304d\u4e8b\u9805\u304c\u5897\u3048\u3066\u3057\u307e\u3046\u306e\u3067\u3001\u500b\u4eba\u7684\u306b\u306f\u7279\u5225\u306a\u77e5\u8b58\u304c\u306a\u304f\u3066\u3082\u30b3\u30fc\u30c9\u3092\u898b\u308c\u3070\u308f\u304b\u308b\u3088\u3046\u306b\u3001\u305f\u3060\u306e\u95a2\u6570\u3068\u3057\u3066\u5b9a\u7fa9\u3057\u3066\u3057\u307e\u3046\u65b9\u304c\u597d\u307f\u3067\u3059\u306d\u3002<\/p>\nItem Pipeline\u3092\u4f7f\u3046\u30b1\u30fc\u30b9\u3067parse\u6642\u306bItem\u3068\u6b21\u306eRequest\u306eyield\u3092\u4e21\u65b9\u5b9f\u884c\u3059\u308b<\/h3>\n
parse()<\/code>\u306e\u4e2d\u3067Item\u3068Request\u306e\u4e21\u65b9\u3092yield\u3059\u308c\u3070\u826f\u3044\u306f\u305a\u3067\u3059\u3002<\/p>\n
main.py<\/code>\u306e
parse()<\/code>\u306b3\u884c\u8ffd\u52a0\u3057\u3066\u3001
main()<\/code>\u3067
process.crawl()<\/code>\u3059\u308b\u969b\u306b\u6e21\u3059URL\u30921\u30da\u30fc\u30b8\u76ee\u3060\u3051\u306b\u7d5e\u3063\u305f\u3060\u3051\u3067\u3059\u3002<\/p>\n
class MySpider(Spider):\n ### (\u7701\u7565) ###\n\n def parse(self, response: Response) -> Iterator[Union[MyItem, Request]]:\n yield MyItem(\n url=response.url,\n status=response.status,\n title=response.xpath('\/\/title\/text()').extract_first(),\n body=response.body,\n )\n if len(response.css('li.next > a')) > 0: # <= \u3053\u308c\u4ee5\u4e0b3\u884c\u3092\u8ffd\u52a0\n next_url: str = response.urljoin(response.css('li.next > a').attrib['href'])\n yield Request(url=next_url, callback=self.parse)\n\ndef main():\n ### (\u7701\u7565) ###\n\n # \u30af\u30ed\u30fc\u30ea\u30f3\u30b0\u5b9f\u884c\n process: CrawlerProcess = CrawlerProcess(settings=settings)\n process.crawl(MySpider, ['http:\/\/quotes.toscrape.com\/page\/1\/']) # <= \u958b\u59cbURL\u3092page=1\u3060\u3051\u306b\u5909\u66f4\n process.start() # the script will block here until the crawling is finished<\/code><\/pre>\n
quotes-1.html<\/code>\u30fb
quotes-2.html<\/code>\u30fb
quotes-8.html<\/code>\u30fb
quotes-9.html<\/code>\u306e4\u30da\u30fc\u30b8\u5206\u306eHTML\u304c\u51fa\u529b\u3055\u308c\u3066\u3001\u305d\u306e\u4ed6\u306e\u30da\u30fc\u30b8\u306f
DropItem<\/code>\u3055\u308c\u305f\u65e8\u306e\u30ed\u30b0\u304c\u51fa\u3066\u3044\u305f\u306e\u3067\u3001\u6b63\u5e38\u306bSpider\u306e
parse()<\/code>\u51e6\u7406\u3067\u3001Item\u3068\u6b21\u306eRequest\u306e\u4e21\u65b9\u3092yield\u3059\u308b\u3053\u3068\u304c\u3067\u304d\u3066\u3044\u305f\u3088\u3046\u3067\u3059\u3002<\/p>\n
LinkExtractor<\/a>\u3092\u5229\u7528\u3059\u308b<\/h3>\n
main.py<\/code>\u3092\u4f5c\u3063\u3066\u3001
MyLinkExtractSpider<\/code>\u30af\u30e9\u30b9\u3092\u5b9f\u88c5\u3057\u3001Rule\u306b
LinkExtractor<\/code>\u3092\u6307\u5b9a\u3057\u305f\u3060\u3051\u306e\u30b5\u30f3\u30d7\u30eb\u3092\u5b9f\u88c5\u3057\u307e\u3057\u305f\u3002<\/p>\n
from scrapy.http import Response\nfrom scrapy.crawler import CrawlerProcess\nfrom scrapy.spiders import CrawlSpider, Rule\nfrom scrapy.linkextractors import LinkExtractor\nfrom typing import Dict, Any\n\nclass MyLinkExtractSpider(CrawlSpider):\n name = 'my_link_extract_spider'\n start_urls = ['http:\/\/quotes.toscrape.com\/']\n rules = (\n Rule(\n LinkExtractor(\n allow=r'http:\/\/quotes.toscrape.com\/page\/\\d+\/', # \u8a31\u53ef\u3059\u308bURL\u306e\u30d1\u30bf\u30fc\u30f3\u3092\u6b63\u898f\u8868\u73fe\u3067\n unique=True, # URL\u3092\u30e6\u30cb\u30fc\u30af\u306b\u3059\u308b\u304b\u3069\u3046\u304b\n tags=['a'], # \u5bfe\u8c61\u3068\u3059\u308b\u30bf\u30b0\n ),\n follow=True, # Response\u30aa\u30d6\u30b8\u30a7\u30af\u30c8\u306e\u30ea\u30f3\u30af\u3092\u305f\u3069\u308b\u304b\u3069\u3046\u304b\n callback='log_url' # \u5404\u30ea\u30f3\u30af\u306b\u5bfe\u3059\u308bResponse\u3092\u53d7\u3051\u53d6\u3063\u305f\u969b\u306b\u547c\u3073\u51fa\u3055\u308c\u308b\u30b3\u30fc\u30eb\u30d0\u30c3\u30af\u95a2\u6570\n ),\n )\n\n def log_url(self, response: Response):\n print(f'response.url: {response.url}')\n\ndef main():\n settings: Dict[str, Any] = {\n 'DOWNLOAD_DELAY': 3,\n 'TELNETCONSOLE_ENABLED': False,\n }\n\n process: CrawlerProcess = CrawlerProcess(settings=settings)\n process.crawl(MyLinkExtractSpider)\n process.start() # the script will block here until the crawling is finished\n\nif __name__ == "__main__":\n main()<\/code><\/pre>\n
\n
\n
process_value<\/code>\u306b\u95a2\u6570\u3092\u6307\u5b9a\u3067\u304d\u307e\u3059\u306e\u3067\u3001\u4f8b\u3048\u3070href\u306bjavascript\u306e\u51e6\u7406\u304c\u5165\u3063\u3066\u3044\u305f\u3068\u3057\u3066\u3082\u95a2\u6570\u3067\u30d1\u30fc\u30b9\u3057\u3066URL\u90e8\u5206\u3092\u629c\u304d\u51fa\u3059\u3053\u3068\u3082\u53ef\u80fd\u3067\u3059<\/li>\n<\/ul>\n<\/li>\n
python main.py<\/code>\u3067\u5b9f\u884c\u3067\u304d\u307e\u3059\u3002<\/p>\n
response.url: http:\/\/quotes.toscrape.com\/page\/1\/\nresponse.url: http:\/\/quotes.toscrape.com\/page\/2\/\nresponse.url: http:\/\/quotes.toscrape.com\/page\/3\/\n: (\u7701\u7565)\nresponse.url: http:\/\/quotes.toscrape.com\/page\/10\/<\/code><\/pre>\n
SitemapSpider<\/a>\u3092\u5229\u7528\u3059\u308b<\/h3>\n
\n\u3053\u3061\u3089<\/a>\u306b\u30b5\u30f3\u30d7\u30eb\u304c\u3042\u308b\u306e\u3067\u3001\u305d\u308c\u3092\u53c2\u8003\u306b\u3001\u96d1\u3067\u3059\u304cQiita\u306eURL\u4e00\u89a7\u3092\u30ed\u30b0\u306b\u51fa\u529b\u3059\u308b\u3060\u3051\u306eSpider\u3092\u4f5c\u3063\u3066\u307f\u307e\u3057\u305f\u3002<\/p>\nfrom scrapy.spiders import SitemapSpider\n\nclass MySitemapSpider(SitemapSpider):\n name = 'my_sitemap_spider'\n sitemap_urls = ['https:\/\/qiita.com\/robots.txt'] # sitemap.xml\u3084robots.txt\u3092\u6307\u5b9a\u3059\u308b\n\n def parse(self, response: Response):\n print(f'response.url: {response.url}')<\/code><\/pre>\n
response.url: https:\/\/qiita.com\/tags\/comprehension\nresponse.url: https:\/\/qiita.com\/tags\/%23rute53\nresponse.url: https:\/\/qiita.com\/tags\/%23dns\nresponse.url: https:\/\/qiita.com\/tags\/%EF%BC%83lightsail\n:<\/code><\/pre>\n
\u305d\u306e\u4ed6<\/h2>\n
scrapy shell\u306f\u4fbf\u5229<\/h3>\n
scrapy shell \u5bfe\u8c61URL<\/code><\/pre>\n
scrapy shell http:\/\/quotes.toscrape.com\/<\/code>\u3092\u5b9f\u884c\u3057\u305f\u5834\u5408\u306f\u4ee5\u4e0b\u306e\u30aa\u30d6\u30b8\u30a7\u30af\u30c8\u304c\u5229\u7528\u3067\u304d\u307e\u3059\u3002<\/p>\n
[s] Available Scrapy objects:\n[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)\n[s] crawler <scrapy.crawler.Crawler object at 0x102d21610>\n[s] item {}\n[s] request <GET http:\/\/quotes.toscrape.com\/>\n[s] response <200 http:\/\/quotes.toscrape.com\/>\n[s] settings <scrapy.settings.Settings object at 0x102d219a0>\n[s] spider <DefaultSpider 'default' at 0x1030efb80>\n[s] Useful shortcuts:\n[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)\n[s] fetch(req) Fetch a scrapy.Request and update local objects \n[s] shelp() Shell help (print this help)\n[s] view(response) View response in a browser<\/code><\/pre>\n
>>> response.css('li.next > a')\n[<Selector xpath="descendant-or-self::li[@class and contains(concat(' ', normalize-space(@class), ' '), ' next ')]\/a" data='<a href="\/page\/2\/">Next <span aria-hi...'>]\n>>> response.css('li.next > a').attrib['href']\n'\/page\/2\/'<\/code><\/pre>\n
\u30d8\u30c3\u30c9\u30ec\u30b9\u30d6\u30e9\u30a6\u30b6\u5bfe\u5fdc<\/h3>\n
\nhttps:\/\/github.com\/scrapy-plugins\/scrapy-splash<\/a><\/p>\nscrapy.Request<\/code>\u306e\u4ee3\u308f\u308a\u306b
SplashRequest <\/code>\u3092yield\u3059\u308b\u3068\u3001\u30d8\u30c3\u30c9\u30ec\u30b9\u30d6\u30e9\u30a6\u30b6splash\u3067\u8a72\u5f53\u30da\u30fc\u30b8\u304c\u8aad\u307f\u8fbc\u307e\u308c\u3001
SplashRequest<\/code>\u306b\u6e21\u3057\u305fJavaScript\u30b3\u30fc\u30c9<\/a>\u304c\u5b9f\u884c\u3055\u308c\u3001\u305d\u306e\u4e2d\u3067
return<\/code>\u3055\u308c\u305f\u5024\u304cHTTP Response\u3068\u3057\u3066\u8fd4\u5374\u3055\u308c\u308b\u3088\u3046\u3067\u3059\u3002<\/p>\n
document.querySelector()<\/code>\u306a\u3069\u306f\u5b9f\u884c\u53ef\u80fd\u306a\u306e\u3067\u3001\u7279\u5b9a\u306e\u30a8\u30ec\u30e1\u30f3\u30c8\u3092\u53d6\u5f97\u3057\u3066\u5c5e\u6027\u3092\u53d6\u5f97\u3059\u308b\u3088\u3046\u306a\u3053\u3068\u306f\u53ef\u80fd\u3063\u307d\u3044\u3067\u3059\u304c\u3001\u753b\u9762\u9077\u79fb\u306a\u3069\u306b\u95a2\u3057\u3066\u306f\u672a\u77e5\u6570\u3067\u3059\u3002<\/p>\n
CrawlerProcess(CrawlerRunner)\u3092\u540c\u4e00\u30d7\u30ed\u30bb\u30b9\u5185\u3067\u4e8c\u56de\u5b9f\u884c\u3059\u308b<\/h3>\n
twisted.internet.reactor<\/code>\u3092\u4f7f\u3063\u3066\u51e6\u7406\u304c\u5b8c\u4e86\u3092\u691c\u77e5\u3057\u307e\u3059\u3002\u3057\u304b\u3057\u3001\u3053\u306e
twisted.internet.reactor<\/code>\u304c\u66f2\u8005\u3067\u3001\u540c\u4e00\u30d7\u30ed\u30bb\u30b9\u4e0a\u3067\u4e8c\u5ea6\u958b\u59cb\u3059\u308b\u3053\u3068\u304c\u3067\u304d\u307e\u305b\u3093\u3002<\/p>\n
process: CrawlerProcess = CrawlerProcess(settings=settings)\n\n # 1\u56de\u76ee\u5b9f\u884c\n process.crawl(MySpider, ['http:\/\/quotes.toscrape.com\/page\/1\/', 'http:\/\/quotes.toscrape.com\/page\/2\/'])\n process.start() # the script will block here until the crawling is finished\n # 2\u56de\u76ee\u5b9f\u884c\n process.crawl(MySpider, ['http:\/\/quotes.toscrape.com\/page\/1\/', 'http:\/\/quotes.toscrape.com\/page\/2\/'])\n process.start() # the script will block here until the crawling is finished<\/code><\/pre>\n
twisted.internet.error.ReactorNotRestartable<\/code>\u30a8\u30e9\u30fc\u304c\u767a\u751f\u3057\u307e\u3059\u3002<\/p>\n
Traceback (most recent call last):\n File "main.py", line 41, in <module>\n main()\n File "main.py", line 37, in main\n process.start() # the script will block here until the crawling is finished\n File "~\/python_scraping_sample\/.venv\/lib\/python3.8\/site-packages\/scrapy\/crawler.py", line 327, in start\n reactor.run(installSignalHandlers=False) # blocking call\n File "~\/python_scraping_sample\/.venv\/lib\/python3.8\/site-packages\/twisted\/internet\/base.py", line 1282, in run\n self.startRunning(installSignalHandlers=installSignalHandlers)\n File "~\/python_scraping_sample\/.venv\/lib\/python3.8\/site-packages\/twisted\/internet\/base.py", line 1262, in startRunning\n ReactorBase.startRunning(self)\n File "~\/python_scraping_sample\/.venv\/lib\/python3.8\/site-packages\/twisted\/internet\/base.py", line 765, in startRunning\n raise error.ReactorNotRestartable()\ntwisted.internet.error.ReactorNotRestartable<\/code><\/pre>\n
process._stop_reactor()<\/code>\u306f\u7279\u306b\u610f\u5473\u304c\u306a\u304f\u3001\u305d\u306e\u4ed6\u8272\u3005\u8a66\u3057\u307e\u3057\u305f\u304c\u5168\u90e8\u30c0\u30e1\u3067\u3057\u305f\uff09<\/p>\n
concurrent.futures.ProcessPoolExecutor<\/code>\u3084
multiprocessing.Process<\/code>\u3092\u4f7f\u3063\u3066\u4ee5\u4e0b\u306e\u3088\u3046\u306a\u611f\u3058\u3067\u5b9f\u88c5\u3059\u308b\u3068\u554f\u984c\u306a\u304f\u52d5\u304d\u307e\u3059\u3002<\/p>\n
from multiprocessing import Process\n\ndef start_crawl(settings: Dict[str, Any], urls: List[str]):\n process: CrawlerProcess = CrawlerProcess(settings=settings)\n process.crawl(MySpider, urls) #MySpider\u306e\u5b9f\u88c5\u306f\u7701\u7565\n process.start() # the script will block here until the crawling is finished\n\ndef main():\n settings: Dict[str, Any] = {\n 'DOWNLOAD_DELAY': 3,\n 'TELNETCONSOLE_ENABLED': False,\n }\n\n # \u30af\u30ed\u30fc\u30ea\u30f3\u30b0\u3092\u5225\u30d7\u30ed\u30bb\u30b9\u3067\u4e8c\u56de\u5b9f\u884c\n Process(target=start_crawl, args=(settings, ['http:\/\/quotes.toscrape.com\/page\/1\/'])).start()\n Process(target=start_crawl, args=(settings, ['http:\/\/quotes.toscrape.com\/page\/2\/', 'http:\/\/quotes.toscrape.com\/page\/3\/'])).start()\n\nif __name__ == "__main__":\n main()<\/code><\/pre>\n
\u30ea\u30dd\u30b8\u30c8\u30ea<\/h2>\n
\u3055\u3044\u3054\u306b<\/h2>\n
\n\u8a2d\u5b9a<\/a>\u3060\u3051\u3067\u632f\u308b\u821e\u3044\u3092\u5909\u3048\u308b\u3053\u3068\u304c\u3067\u304d\u308b\u306e\u3082\u3001\u4fbf\u5229\u306a\u70b9\u3068\u8a00\u3048\u308b\u3068\u601d\u3044\u307e\u3059\u3002
\n\u4e00\u65b9\u3067\u3001\u30c9\u30ad\u30e5\u30e1\u30f3\u30c8\u3092\u898b\u305a\u306b\u30bd\u30fc\u30b9\u30b3\u30fc\u30c9\u3060\u3051\u3092\u898b\u3066\u3001\u3069\u3046\u52d5\u304f\u306e\u304b\u3092\u7406\u89e3\u3059\u308b\u306e\u306f\u96e3\u3057\u304f\u5b66\u7fd2\u30b3\u30b9\u30c8\u304c\u9ad8\u3044\u30e9\u30a4\u30d6\u30e9\u30ea\u3068\u3044\u3046\u5370\u8c61\u3067\u3057\u305f\u3002<\/p>\n