{"id":300,"date":"2020-02-24T11:04:36","date_gmt":"2020-02-24T02:04:36","guid":{"rendered":"http:\/\/localhost:8000\/?p=300"},"modified":"2021-01-17T11:06:33","modified_gmt":"2021-01-17T02:06:33","slug":"nlp-preprocessing","status":"publish","type":"post","link":"http:\/\/localhost:8000\/2020\/02\/nlp-preprocessing.html","title":{"rendered":"NLP\u306e\u524d\u51e6\u7406"},"content":{"rendered":"<p>\u4ed5\u4e8b\u3067\u81ea\u7136\u8a00\u8a9e\u51e6\u7406\uff08NLP\uff09\u306b\u5c11\u3057\u53d6\u308a\u7d44\u3080\u5fc5\u8981\u304c\u51fa\u3066\u304d\u305f\u306e\u3067\u3001\u81ea\u5206\u306a\u308a\u306e\u7406\u89e3\u3092Tips\u3068\u3057\u3066\u307e\u3068\u3081\u3066\u3044\u3053\u3046\u3068\u601d\u3044\u307e\u3059\u3002<\/p>\n<h2>\u5c0f\u6587\u5b57\u5316<\/h2>\n<p>\u6587\u5b57\u306e\u6b63\u898f\u5316\u3068\u3044\u3046\u610f\u5473\u3067\u3001\u30a2\u30eb\u30d5\u30a1\u30d9\u30c3\u30c8\u3092\u5c0f\u6587\u5b57\u5316\u3057\u307e\u3059\u3002\u65e5\u672c\u8a9e\u306e\u5834\u5408\u306f\u3001\u534a\u89d2\u3092\u5168\u89d2\u306b\u7d71\u4e00\u3059\u308b\u3001\u306a\u3069\u306e\u5bfe\u5fdc\u3082\u5fc5\u8981\u3068\u601d\u3044\u307e\u3059\u3002<\/p>\n<pre><code class=\"language-python\">sentences: List[str] = [&#039;I  have  a pen&#039;, &#039;That  is a window&#039;]\nprint(sentences)\n# -&gt; [&#039;I  have  a pen&#039;, &#039;That  is a window&#039;]\n\nlower_sentences: List[str] = list(\n    sentence.lower() for sentence in sentences\n)\nprint(lower_sentences)\n# -&gt; [&#039;i  have  a pen&#039;, &#039;that  is a window&#039;]<\/code><\/pre>\n<h2>tokenize<\/h2>\n<p>\u6587\u66f8\u3092\u30c8\u30fc\u30af\u30f3\uff08\u6700\u5c0f\u5358\u4f4d\u306e\u6587\u5b57\u3084\u6587\u5b57\u5217\uff09\u306b\u5206\u5272\u3057\u307e\u3059\u3002\u5206\u5272\u306b\u306f\u3001nltk\uff08\u81ea\u7136\u8a00\u8a9e\u51e6\u7406\u306e\u30c4\u30fc\u30eb\u30ad\u30c3\u30c8\u3092\u63d0\u4f9b\u3059\u308b\u30e9\u30a4\u30d6\u30e9\u30ea\uff09\u3092\u5229\u7528\u3057\u307e\u3059\u3002\u82f1\u8a9e\u306a\u306e\u3067<code>punkt<\/code>\u30d1\u30c3\u30b1\u30fc\u30b8\u3092\u4f7f\u3063\u3066\u3044\u307e\u3059\u3002<br \/>\n\u6587\u66f8\u306e\u30ea\u30b9\u30c8\u3092\u30c8\u30fc\u30af\u30f3\u306b\u5206\u5272\u3059\u308b\u30b5\u30f3\u30d7\u30eb\u306f\u4ee5\u4e0b\u306e\u901a\u308a\u3067\u3059\u3002<\/p>\n<pre><code class=\"language-python\">import nltk\nnltk.download(&#039;punkt&#039;)\nfrom typing import List\nsentences: List[str] = [&#039;i  have  a pen&#039;, &#039;that  is a window&#039;]\nprint(sentences)\n# -&gt; [&#039;i  have  a pen&#039;, &#039;that  is a window&#039;]\n\nwords_list: List[List[str]] = list(\n    nltk.tokenize.word_tokenize(sentence) for sentence in sentences\n)\nprint(words_list)\n# -&gt; [[&#039;i&#039;, &#039;have&#039;, &#039;a&#039;, &#039;pen&#039;], [&#039;that&#039;, &#039;is&#039;, &#039;a&#039;, &#039;window&#039;]]<\/code><\/pre>\n<h2>stop-words\u9664\u5916<\/h2>\n<p>stop-word\u3068\u306f\u3001<code>the<\/code>,<code>a<\/code>,<code>for<\/code>,<code>of<\/code>\u306e\u3088\u3046\u306a\u4e00\u822c\u8a9e\u306a\u3069\u3001\u5206\u6790\u306b\u5f71\u97ff\u3092\u4e0e\u3048\u306a\u3044\u5358\u8a9e\u306e\u3053\u3068\u3067\u3001\u3053\u308c\u3089\u3092\u9664\u5916\u3059\u308b\u3053\u3068\u306b\u3088\u3063\u3066\u5f8c\u7d9a\u51e6\u7406\u306e\u8a08\u7b97\u91cf\u3092\u4e0b\u3052\u308b\u3053\u3068\u304c\u3067\u304d\u307e\u3059\u3002<code>nltk<\/code>\u3067<code>stopwords<\/code>\u304c\u5b9a\u7fa9\u3055\u308c\u3066\u3044\u308b\u306e\u3067\u3001\u305d\u3061\u3089\u3092\u5229\u7528\u3059\u308b\u30b5\u30f3\u30d7\u30eb\u3092\u8a18\u8f09\u3057\u307e\u3059\u3002<\/p>\n<pre><code class=\"language-python\">from typing import List\nimport nltk\nnltk.download(&#039;stopwords&#039;)\n\nwords_list: List[List[str]] = [[&#039;i&#039;, &#039;have&#039;, &#039;a&#039;, &#039;pen&#039;], [&#039;that&#039;, &#039;is&#039;, &#039;a&#039;, &#039;window&#039;]]\nstopwords: List[str] = nltk.corpus.stopwords.words(&#039;english&#039;)\nprint(stopwords)\n# -&gt; [&#039;i&#039;, &#039;me&#039;, &#039;my&#039;, &#039;myself&#039;, &#039;we&#039;, &#039;our&#039;, &#039;ours&#039;, ...\uff08\u7701\u7565\uff09]\n\nnormalized_words_list: List[List[str]] = list(\n    list(word for word in words if word not in stopwords) for words in words_list\n)\nprint(normalized_words_list)\n# -&gt; [[&#039;pen&#039;], [&#039;window&#039;]]<\/code><\/pre>\n<p>\u4e0a\u8a18\u4ee5\u5916\u306b\u3001\u8a18\u53f7\u30841\u6587\u5b57\u306e\u82f1\u6570\u5b57\u3092\u9664\u5916\u3057\u305f\u3044\u30b1\u30fc\u30b9\u3082\u3042\u308b\u3068\u601d\u3044\u307e\u3059\u3002\u305d\u306e\u5834\u5408\u306f<code>string<\/code>\u30d1\u30c3\u30b1\u30fc\u30b8\u3092\u4f7f\u3063\u3066\u6587\u5b57\u3092\u53d6\u5f97\u3057\u3066\u3001\u305d\u308c\u3089\u3092\u4f7f\u3063\u3066\u9664\u5916\u3059\u308b\u3068\u697d\u3060\u3068\u601d\u3044\u307e\u3059\u3002<\/p>\n<pre><code class=\"language-python\">from typing import List\nimport string\nexclude_words: List[str] = list(string.ascii_lowercase) + list(string.digits) + list(string.punctuation)\nprint(exclude_words)\n# -&gt; [&#039;a&#039;, &#039;b&#039;, &#039;c&#039;, &#039;d&#039;, &#039;e&#039;, &#039;f&#039;, &#039;g&#039;, &#039;h&#039;, &#039;i&#039;, &#039;j&#039;, &#039;k&#039;, &#039;l&#039;, &#039;m&#039;, &#039;n&#039;, &#039;o&#039;, &#039;p&#039;, &#039;q&#039;, &#039;r&#039;, &#039;s&#039;, &#039;t&#039;, &#039;u&#039;, &#039;v&#039;, &#039;w&#039;, &#039;x&#039;, &#039;y&#039;, &#039;z&#039;, &#039;0&#039;, &#039;1&#039;, &#039;2&#039;, &#039;3&#039;, &#039;4&#039;, &#039;5&#039;, &#039;6&#039;, &#039;7&#039;, &#039;8&#039;, &#039;9&#039;, &#039;!&#039;, &#039;&quot;&#039;, &#039;#&#039;, &#039;$&#039;, &#039;%&#039;, &#039;&amp;&#039;, &quot;&#039;&quot;, &#039;(&#039;, &#039;)&#039;, &#039;*&#039;, &#039;+&#039;, &#039;,&#039;, &#039;-&#039;, &#039;.&#039;, &#039;\/&#039;, &#039;:&#039;, &#039;;&#039;, &#039;&lt;&#039;, &#039;=&#039;, &#039;&gt;&#039;, &#039;?&#039;, &#039;@&#039;, &#039;[&#039;, &#039;\\\\&#039;, &#039;]&#039;, &#039;^&#039;, &#039;_&#039;, &#039;`&#039;, &#039;{&#039;, &#039;|&#039;, &#039;}&#039;, &#039;~&#039;]<\/code><\/pre>\n<h2>\u30b7\u30ce\u30cb\u30e0\u306e\u9069\u7528<\/h2>\n<p>TBD<\/p>\n<h2>\u30b9\u30c6\u30df\u30f3\u30b0\uff08stemming\uff09<\/h2>\n<p>\u30b9\u30c6\u30df\u30f3\u30b0\u3068\u306f\u3001\u8a9e\u5c3e\u304c\u5909\u5316\u3059\u308b\u5358\u8a9e\u306e\u8a9e\u5e79\u90e8\u5206\u3092\u629c\u304d\u51fa\u3059\u51e6\u7406\u306e\u3053\u3068\u3092\u8a00\u3044\u307e\u3059\u3002\u305d\u3053\u3060\u3051\u629c\u304d\u51fa\u3059\u3068\u4eba\u9593\u306b\u306f\u9055\u548c\u611f\u304c\u3042\u308b\u6587\u5b57\u5217\u306b\u306a\u308b\u3053\u3068\u3082\u3042\u308a\u307e\u3059\u3002<br \/>\n<code>nltk<\/code>\u306e<code>PorterStemmer<\/code>\u3092\u4f7f\u3046\u30b5\u30f3\u30d7\u30eb\u3092\u8a18\u8f09\u3057\u307e\u3057\u305f\u3002<code>mechanical <\/code>\u304c<code>mechan<\/code>\u306b\u306a\u3063\u305f\u308a\u3001<code>pencils<\/code>\u304c<code>pencil<\/code>\u306b\u5909\u63db\u3055\u308c\u308b\u4e00\u65b9\u3067\u3001<code>went<\/code>\u306f<code>go<\/code>\u306b\u306f\u306a\u3089\u306a\u3044\u3088\u3046\u3067\u3059\uff08\u8a9e\u5e79\u90e8\u5206\u3092\u629c\u304d\u51fa\u3057\u305f\u3060\u3051\u306a\u306e\u3067\uff09\u3002<\/p>\n<pre><code class=\"language-python\">import nltk\nfrom nltk.stem.porter import PorterStemmer\nps: PorterStemmer = PorterStemmer()\nwords: List[str] = [&#039;mechanical&#039;, &#039;pencil&#039;, &#039;go&#039;, &#039;went&#039;, &#039;goes&#039;, &#039;pencils&#039;]\nprint(words)\n# -&gt; [&#039;mechanical&#039;, &#039;pencil&#039;, &#039;go&#039;, &#039;went&#039;, &#039;goes&#039;, &#039;pencils&#039;]\n\nstemmed_words: List[str] = list(ps.stem(word) for word in words)\nprint(stemmed_words)\n# -&gt; [&#039;mechan&#039;, &#039;pencil&#039;, &#039;go&#039;, &#039;went&#039;, &#039;goe&#039;, &#039;pencil&#039;]<\/code><\/pre>\n<h2>\u30b3\u30fc\u30d1\u30b9\u5316<\/h2>\n<p>\u30b3\u30fc\u30d1\u30b9\u3068\u3044\u3046\u306e\u306f\u3001\u81ea\u7136\u8a00\u8a9e\u51e6\u7406\u3092\u884c\u3044\u3084\u3059\u3044\u5f62\u306b\u3001\u69cb\u9020\u5316\u3055\u308c\u305f\u30c7\u30fc\u30bf\u306e\u3053\u3068\u3092\u3044\u3046\u3089\u3057\u3044\u3067\u3059\u3002\u81ea\u5206\u304c\u3053\u308c\u307e\u3067\u898b\u305f\u3082\u306e\u306f\u30b3\u30fc\u30d1\u30b9\u5316\uff1d\u30d9\u30af\u30c8\u30eb\u5316\u306e\u3088\u3046\u3067\u3057\u305f\u3002<br \/>\n<code>words_list<\/code>\u3092\u3001\u5404\u8981\u7d20\u304c\u5358\u8a9e\u3001\u305d\u306e\u5024\u304c\u51fa\u73fe\u56de\u6570\u3092\u8868\u3059\u30d9\u30af\u30c8\u30eb\u306b\u5909\u63db\u3059\u308b\u30b5\u30f3\u30d7\u30eb\u306f\u4ee5\u4e0b\u306e\u901a\u308a\u3067\u3059\u3002<\/p>\n<pre><code class=\"language-python\">from typing import List, Tuple\nfrom gensim import corpora\n\nwords_list: List[List[str]] = [[&#039;i&#039;, &#039;have&#039;, &#039;a&#039;, &#039;red&#039;, &#039;pen&#039;, &#039;and&#039;, &#039;a&#039;, &#039;blue&#039;, &#039;pen&#039;], [&#039;you&#039;, &#039;like&#039;, &#039;red&#039;]]\nprint(words_list)\n# -&gt; [[&#039;i&#039;, &#039;have&#039;, &#039;a&#039;, &#039;red&#039;, &#039;pen&#039;, &#039;and&#039;, &#039;a&#039;, &#039;blue&#039;, &#039;pen&#039;], [&#039;you&#039;, &#039;like&#039;, &#039;red&#039;]]\n\ndictionary: corpora.Dictionary = corpora.Dictionary(words_list)\nprint(dictionary.token2id)\n# -&gt; {&#039;a&#039;: 0, &#039;and&#039;: 1, &#039;blue&#039;: 2, &#039;have&#039;: 3, &#039;i&#039;: 4, &#039;pen&#039;: 5, &#039;red&#039;: 6, &#039;like&#039;: 7, &#039;you&#039;: 8}\n\ncorpus: List[List[Tuple[int, int]]] = list(map(dictionary.doc2bow, words_list)) # \u30d9\u30af\u30c8\u30eb\u5316\nprint(corpus)\n# -&gt; [[(0, 2), (1, 1), (2, 1), (3, 1), (4, 1), (5, 2), (6, 1)], [(6, 1), (7, 1), (8, 1)]]<\/code><\/pre>\n<p><code>dictionary<\/code>\u306f\u5358\u8a9e\u6587\u5b57\u5217 -&gt; ID\uff08\u9023\u756a\u3002int\uff09\u306b\u5909\u63db\u3059\u308b\u8f9e\u66f8\u3067\u3059\u3002\u6587\u5b57\u5217\u3092\u305d\u306e\u307e\u307e\u6271\u3046\u3068\u30e1\u30e2\u30ea\u3092\u98df\u3046\u306e\u3067int\u306b\u5909\u63db\u3057\u3066\u3044\u308b\u306e\u3060\u3068\u3082\u601d\u3044\u307e\u3059\u3002<br \/>\n\u3053\u3053\u3067\u4f5c\u6210\u3057\u305f<code>copus<\/code>\u306f\u3001\u5217\u304c\u5358\u8a9e\uff08\u6b63\u78ba\u306b\u306f\u5358\u8a9eID\uff09\u884c\u304c\u6587\u66f8\u3092\u610f\u5473\u3059\u308b\u884c\u5217\u3068\u8003\u3048\u308b\u3053\u3068\u304c\u3067\u304d\u307e\u3059\u30021\u884c\uff1d1\u6587\u66f8\u30d9\u30af\u30c8\u30eb\u3067\u3059\u3002<br \/>\n\u306a\u304a<code>doc2bow<\/code>\u306e<code>bow<\/code>\u306f<code>bag-of-words<\/code>\u306e\u7565\u3067\u3001\u6587\u66f8\u3092<code>bag-of-words<\/code>\u306b\u5909\u63db\u3059\u308b\u95a2\u6570\u306b\u306a\u308a\u307e\u3059\u3002<code>bag-of-words<\/code>\u306f\u76f4\u8a33\u3059\u308b\u3068<code>\u5358\u8a9e\u306e\u888b<\/code>\u3067\u3001\u5358\u8a9e\u3092\u888b\u306b\u307e\u3068\u3081\u3066\uff08\u3064\u307e\u308a\u5358\u8a9e\u3067\u30b0\u30eb\u30fc\u30d4\u30f3\u30b0\u3057\u3066\uff09\u3001\u305d\u306e\u888b\u306e\u4e2d\u306b\u30e1\u30bf\u60c5\u5831\u3092\u6301\u305f\u305b\u308b\u3068\u3044\u3046\u30a4\u30e1\u30fc\u30b8\u306e\u3088\u3046\u3067\u3059\u3002\u3053\u306e\u30b1\u30fc\u30b9\u3067\u306f\u30e1\u30bf\u60c5\u5831\u3068\u3057\u3066\u5358\u8a9e\u306e\u51fa\u73fe\u56de\u6570\u3092\u4fdd\u6301\u3057\u3066\u3044\u307e\u3059\u3002<\/p>\n","protected":false},"excerpt":{"rendered":"<p>\u4ed5\u4e8b\u3067\u81ea\u7136\u8a00\u8a9e\u51e6\u7406\uff08NLP\uff09\u306b\u5c11\u3057\u53d6\u308a\u7d44\u3080\u5fc5\u8981\u304c\u51fa\u3066\u304d\u305f\u306e\u3067\u3001\u81ea\u5206\u306a\u308a\u306e\u7406\u89e3\u3092Tips\u3068\u3057\u3066\u307e\u3068\u3081\u3066\u3044\u3053\u3046\u3068\u601d\u3044\u307e\u3059\u3002 \u5c0f\u6587\u5b57\u5316 \u6587\u5b57\u306e\u6b63\u898f\u5316\u3068\u3044\u3046\u610f\u5473\u3067\u3001\u30a2\u30eb\u30d5\u30a1\u30d9\u30c3\u30c8\u3092\u5c0f\u6587\u5b57\u5316\u3057\u307e\u3059\u3002\u65e5\u672c\u8a9e\u306e\u5834\u5408\u306f\u3001\u534a\u89d2\u3092\u5168\u89d2\u306b\u7d71\u4e00\u3059\u308b\u3001\u306a\u3069\u306e\u5bfe\u5fdc\u3082\u5fc5\u8981\u3068\u601d\u3044\u307e\u3059\u3002 sentences: List[str] = [&#039;I have a pen&#039;, &#039;That is a window&#039;] print(sentences) # -&gt; [&#039;I have a pen&#039;, &#039;That is a window&#039;] lower_sentences: List[str] = list( sentence.lower() for sentence in sentences ) print(lower_sentences) # -&gt; <a href=\"http:\/\/localhost:8000\/2020\/02\/nlp-preprocessing.html\" class=\"read-more button-fancy -red\"><span class=\"btn-arrow\"><\/span><span class=\"twp-read-more text\">Continue Reading<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[30],"tags":[],"_links":{"self":[{"href":"http:\/\/localhost:8000\/wp-json\/wp\/v2\/posts\/300"}],"collection":[{"href":"http:\/\/localhost:8000\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/localhost:8000\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/localhost:8000\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/localhost:8000\/wp-json\/wp\/v2\/comments?post=300"}],"version-history":[{"count":1,"href":"http:\/\/localhost:8000\/wp-json\/wp\/v2\/posts\/300\/revisions"}],"predecessor-version":[{"id":301,"href":"http:\/\/localhost:8000\/wp-json\/wp\/v2\/posts\/300\/revisions\/301"}],"wp:attachment":[{"href":"http:\/\/localhost:8000\/wp-json\/wp\/v2\/media?parent=300"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/localhost:8000\/wp-json\/wp\/v2\/categories?post=300"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/localhost:8000\/wp-json\/wp\/v2\/tags?post=300"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}