1.
한동안 트위터 트래픽을 이용하여 주가를 예측하는 감성지수(Sentiment Index)가 유행이었습니다.
감성지수과 트레이딩
트위터 감성지수를 이용한 예측지수 모형
트럼프시대의 매매기술, 트위터연동 알고리즘
지금 해외의 경우 SNS를 이용한 감성지수와 관련한 글보다는 SNS의 텍스트를 Altenrnative Data의 한 구성으로 활용하는 방향이 많습니다. 기계학습과 자연어 처리를 결합함 모형입니다. 그런데 개인투자자가 미국시장에서 큰 세력으로 등장하면서 새로운 네트워크가 주목을 받고 이 네트워크에 오르락 내리락 하는 텍스트를 이용한 감성지수가 유행인 듯 합니다. Reddit을 무엇이라고 정의할지 애매하지만 옛날 다음의 아고라일 수도 있고 다양한 사이트의 게시판의 모음일 수도 있습니다. 다양한 주제가 워낙 방대합니다만 대중적으로 열려진 계기는 Wallstreetbets때문입니다.
Sentiment Index에 대한 관심이 다시금 높아지는 것은 몇가지 뉴스를 통해 확인할 수 있습니다. 먼저 작년 뉴스입니다. Reddit Becomes Must-Read for Wall Street Stock-Investing Crowd입니다.
Benn Eifert, chief investment officer of hedge fund QVR Advisors, points to the r/wallstreetbets thread on Reddit, which boasts 1.5 million users — “degenerates,” using the site’s own nomenclature.
“There are influencers within that community that will say, ‘Alright, today we’re buying the Tesla $2500 calls for next Friday,’ and the volumes that will print are huge,” Eifert said in an interview on Bloomberg’s Odd Lots podcast. “And you better believe that the most sophisticated options players in the world — the Susquehannas and Citadel Securities — are extremely focused on this flow and predicting it in real-time.”
There are plenty of examples of websites and platforms that attempt to scan r/wallstreetbets to create alert systems. A page on Medium.com — an online publishing platform — is dedicated to “Momentum Trading off Sentiment from r/wallstreetbets.” A blog post on a website called algotrading101.com reads “Web Scraping Tutorial – Reddit Data for Finance.”
또다른 뉴스는 Gamestop으로 커다란 수익을 올린 Dave Portnoy가 Sentiment Index를 활용한 ETF( VanEck Vectors Social Sentiment ETF)를 만들었다는 소식입니다.
Day-trading Reddit-readers nearly crashed the stock market. Now they’re in an ETF.
BUZZ는 AI기술을 기반으로 한 자체적인 Sentiment Index인 BUZZ NextGen AI US Sentiment Leaders Index를 기반으로 투자를 한다고 합니다. 비슷한 뉴스로 Quant trader turns to reddit for sentiment forecaster라는 이야기도 있습니다.
2.
그러면 Reddit의 뉴스나 댓글을 이용한 Reddit Sentiment 시스템을 만들려면 어떻게 시작할까요? 가장 먼저 Reddit에서 데이타를 가져와야 합니다. 한국처럼 API를 지원하지 않으면 Screen Scrapping 기술을 도입하여야 하지만 트위터처럼 API를 제공하면 API를 이용합니다. Reddit도 API를 제공합니다.
Web Scraping Tutorial – Reddit Data for Finance
Sentiment Analysis for Trading with Reddit Text Data은 Pushshift의 API와 VADER Model를 이용하여 분석모형을 만들고 있습니다. Pushshift는 API를 Reddit의 데이타를 수집할 수 있습니다.
이렇게 수집한 데이타는 VADER (Valence Aware Dictionary for sEntiment Reasoning) sentiment analyzer 를 이용하여 분석합니다. 아래는 모형입니다.
그리고 VADER-Sentiment-Analysis에서 가져온 Python 코드 전체입니다.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 |
# coding: utf-8 # Author: C.J. Hutto # Thanks to George Berry for reducing the time complexity from something like O(N^4) to O(N). # Thanks to Ewan Klein and Pierpaolo Pantone for bringing VADER into NLTK. Those modifications were awesome. # For license information, see LICENSE.TXT """ If you use the VADER sentiment analysis tools, please cite: Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014. """ import os import re import math import string import codecs import json from itertools import product from inspect import getsourcefile from io import open # ##Constants## # (empirically derived mean sentiment intensity rating increase for booster words) B_INCR = 0.293 B_DECR = -0.293 # (empirically derived mean sentiment intensity rating increase for using ALLCAPs to emphasize a word) C_INCR = 0.733 N_SCALAR = -0.74 NEGATE = \ ["aint", "arent", "cannot", "cant", "couldnt", "darent", "didnt", "doesnt", "ain't", "aren't", "can't", "couldn't", "daren't", "didn't", "doesn't", "dont", "hadnt", "hasnt", "havent", "isnt", "mightnt", "mustnt", "neither", "don't", "hadn't", "hasn't", "haven't", "isn't", "mightn't", "mustn't", "neednt", "needn't", "never", "none", "nope", "nor", "not", "nothing", "nowhere", "oughtnt", "shant", "shouldnt", "uhuh", "wasnt", "werent", "oughtn't", "shan't", "shouldn't", "uh-uh", "wasn't", "weren't", "without", "wont", "wouldnt", "won't", "wouldn't", "rarely", "seldom", "despite"] # booster/dampener 'intensifiers' or 'degree adverbs' # http://en.wiktionary.org/wiki/Category:English_degree_adverbs BOOSTER_DICT = \ {"absolutely": B_INCR, "amazingly": B_INCR, "awfully": B_INCR, "completely": B_INCR, "considerable": B_INCR, "considerably": B_INCR, "decidedly": B_INCR, "deeply": B_INCR, "effing": B_INCR, "enormous": B_INCR, "enormously": B_INCR, "entirely": B_INCR, "especially": B_INCR, "exceptional": B_INCR, "exceptionally": B_INCR, "extreme": B_INCR, "extremely": B_INCR, "fabulously": B_INCR, "flipping": B_INCR, "flippin": B_INCR, "frackin": B_INCR, "fracking": B_INCR, "fricking": B_INCR, "frickin": B_INCR, "frigging": B_INCR, "friggin": B_INCR, "fully": B_INCR, "fuckin": B_INCR, "fucking": B_INCR, "fuggin": B_INCR, "fugging": B_INCR, "greatly": B_INCR, "hella": B_INCR, "highly": B_INCR, "hugely": B_INCR, "incredible": B_INCR, "incredibly": B_INCR, "intensely": B_INCR, "major": B_INCR, "majorly": B_INCR, "more": B_INCR, "most": B_INCR, "particularly": B_INCR, "purely": B_INCR, "quite": B_INCR, "really": B_INCR, "remarkably": B_INCR, "so": B_INCR, "substantially": B_INCR, "thoroughly": B_INCR, "total": B_INCR, "totally": B_INCR, "tremendous": B_INCR, "tremendously": B_INCR, "uber": B_INCR, "unbelievably": B_INCR, "unusually": B_INCR, "utter": B_INCR, "utterly": B_INCR, "very": B_INCR, "almost": B_DECR, "barely": B_DECR, "hardly": B_DECR, "just enough": B_DECR, "kind of": B_DECR, "kinda": B_DECR, "kindof": B_DECR, "kind-of": B_DECR, "less": B_DECR, "little": B_DECR, "marginal": B_DECR, "marginally": B_DECR, "occasional": B_DECR, "occasionally": B_DECR, "partly": B_DECR, "scarce": B_DECR, "scarcely": B_DECR, "slight": B_DECR, "slightly": B_DECR, "somewhat": B_DECR, "sort of": B_DECR, "sorta": B_DECR, "sortof": B_DECR, "sort-of": B_DECR} # check for sentiment laden idioms that do not contain lexicon words (future work, not yet implemented) SENTIMENT_LADEN_IDIOMS = {"cut the mustard": 2, "hand to mouth": -2, "back handed": -2, "blow smoke": -2, "blowing smoke": -2, "upper hand": 1, "break a leg": 2, "cooking with gas": 2, "in the black": 2, "in the red": -2, "on the ball": 2, "under the weather": -2} # check for special case idioms and phrases containing lexicon words SPECIAL_CASES = {"the shit": 3, "the bomb": 3, "bad ass": 1.5, "badass": 1.5, "bus stop": 0.0, "yeah right": -2, "kiss of death": -1.5, "to die for": 3, "beating heart": 3.1, "broken heart": -2.9 } # #Static methods# # def negated(input_words, include_nt=True): """ Determine if input contains negation words """ input_words = [str(w).lower() for w in input_words] neg_words = [] neg_words.extend(NEGATE) for word in neg_words: if word in input_words: return True if include_nt: for word in input_words: if "n't" in word: return True '''if "least" in input_words: i = input_words.index("least") if i > 0 and input_words[i - 1] != "at": return True''' return False def normalize(score, alpha=15): """ Normalize the score to be between -1 and 1 using an alpha that approximates the max expected value """ norm_score = score / math.sqrt((score * score) + alpha) if norm_score < -1.0: return -1.0 elif norm_score > 1.0: return 1.0 else: return norm_score def allcap_differential(words): """ Check whether just some words in the input are ALL CAPS :param list words: The words to inspect :returns: `True` if some but not all items in `words` are ALL CAPS """ is_different = False allcap_words = 0 for word in words: if word.isupper(): allcap_words += 1 cap_differential = len(words) - allcap_words if 0 < cap_differential < len(words): is_different = True return is_different def scalar_inc_dec(word, valence, is_cap_diff): """ Check if the preceding words increase, decrease, or negate/nullify the valence """ scalar = 0.0 word_lower = word.lower() if word_lower in BOOSTER_DICT: scalar = BOOSTER_DICT[word_lower] if valence < 0: scalar *= -1 # check if booster/dampener word is in ALLCAPS (while others aren't) if word.isupper() and is_cap_diff: if valence > 0: scalar += C_INCR else: scalar -= C_INCR return scalar class SentiText(object): """ Identify sentiment-relevant string-level properties of input text. """ def __init__(self, text): if not isinstance(text, str): text = str(text).encode('utf-8') self.text = text self.words_and_emoticons = self._words_and_emoticons() # doesn't separate words from\ # adjacent punctuation (keeps emoticons & contractions) self.is_cap_diff = allcap_differential(self.words_and_emoticons) @staticmethod def _strip_punc_if_word(token): """ Removes all trailing and leading punctuation If the resulting string has two or fewer characters, then it was likely an emoticon, so return original string (ie ":)" stripped would be "", so just return ":)" """ stripped = token.strip(string.punctuation) if len(stripped) <= 2: return token return stripped def _words_and_emoticons(self): """ Removes leading and trailing puncutation Leaves contractions and most emoticons Does not preserve punc-plus-letter emoticons (e.g. :D) """ wes = self.text.split() stripped = list(map(self._strip_punc_if_word, wes)) return stripped class SentimentIntensityAnalyzer(object): """ Give a sentiment intensity score to sentences. """ def __init__(self, lexicon_file="vader_lexicon.txt", emoji_lexicon="emoji_utf8_lexicon.txt"): _this_module_file_path_ = os.path.abspath(getsourcefile(lambda: 0)) lexicon_full_filepath = os.path.join(os.path.dirname(_this_module_file_path_), lexicon_file) with codecs.open(lexicon_full_filepath, encoding='utf-8') as f: self.lexicon_full_filepath = f.read() self.lexicon = self.make_lex_dict() emoji_full_filepath = os.path.join(os.path.dirname(_this_module_file_path_), emoji_lexicon) with codecs.open(emoji_full_filepath, encoding='utf-8') as f: self.emoji_full_filepath = f.read() self.emojis = self.make_emoji_dict() def make_lex_dict(self): """ Convert lexicon file to a dictionary """ lex_dict = {} for line in self.lexicon_full_filepath.rstrip('\n').split('\n'): if not line: continue (word, measure) = line.strip().split('\t')[0:2] lex_dict[word] = float(measure) return lex_dict def make_emoji_dict(self): """ Convert emoji lexicon file to a dictionary """ emoji_dict = {} for line in self.emoji_full_filepath.rstrip('\n').split('\n'): (emoji, description) = line.strip().split('\t')[0:2] emoji_dict[emoji] = description return emoji_dict def polarity_scores(self, text): """ Return a float for sentiment strength based on the input text. Positive values are positive valence, negative value are negative valence. """ # convert emojis to their textual descriptions text_no_emoji = "" prev_space = True for chr in text: if chr in self.emojis: # get the textual description description = self.emojis[chr] if not prev_space: text_no_emoji += ' ' text_no_emoji += description prev_space = False else: text_no_emoji += chr prev_space = chr == ' ' text = text_no_emoji.strip() sentitext = SentiText(text) sentiments = [] words_and_emoticons = sentitext.words_and_emoticons for i, item in enumerate(words_and_emoticons): valence = 0 # check for vader_lexicon words that may be used as modifiers or negations if item.lower() in BOOSTER_DICT: sentiments.append(valence) continue if (i < len(words_and_emoticons) - 1 and item.lower() == "kind" and words_and_emoticons[i + 1].lower() == "of"): sentiments.append(valence) continue sentiments = self.sentiment_valence(valence, sentitext, item, i, sentiments) sentiments = self._but_check(words_and_emoticons, sentiments) valence_dict = self.score_valence(sentiments, text) return valence_dict def sentiment_valence(self, valence, sentitext, item, i, sentiments): is_cap_diff = sentitext.is_cap_diff words_and_emoticons = sentitext.words_and_emoticons item_lowercase = item.lower() if item_lowercase in self.lexicon: # get the sentiment valence valence = self.lexicon[item_lowercase] # check for "no" as negation for an adjacent lexicon item vs "no" as its own stand-alone lexicon item if item_lowercase == "no" and i != len(words_and_emoticons)-1 and words_and_emoticons[i + 1].lower() in self.lexicon: # don't use valence of "no" as a lexicon item. Instead set it's valence to 0.0 and negate the next item valence = 0.0 if (i > 0 and words_and_emoticons[i - 1].lower() == "no") \ or (i > 1 and words_and_emoticons[i - 2].lower() == "no") \ or (i > 2 and words_and_emoticons[i - 3].lower() == "no" and words_and_emoticons[i - 1].lower() in ["or", "nor"] ): valence = self.lexicon[item_lowercase] * N_SCALAR # check if sentiment laden word is in ALL CAPS (while others aren't) if item.isupper() and is_cap_diff: if valence > 0: valence += C_INCR else: valence -= C_INCR for start_i in range(0, 3): # dampen the scalar modifier of preceding words and emoticons # (excluding the ones that immediately preceed the item) based # on their distance from the current item. if i > start_i and words_and_emoticons[i - (start_i + 1)].lower() not in self.lexicon: s = scalar_inc_dec(words_and_emoticons[i - (start_i + 1)], valence, is_cap_diff) if start_i == 1 and s != 0: s = s * 0.95 if start_i == 2 and s != 0: s = s * 0.9 valence = valence + s valence = self._negation_check(valence, words_and_emoticons, start_i, i) if start_i == 2: valence = self._special_idioms_check(valence, words_and_emoticons, i) valence = self._least_check(valence, words_and_emoticons, i) sentiments.append(valence) return sentiments def _least_check(self, valence, words_and_emoticons, i): # check for negation case using "least" if i > 1 and words_and_emoticons[i - 1].lower() not in self.lexicon \ and words_and_emoticons[i - 1].lower() == "least": if words_and_emoticons[i - 2].lower() != "at" and words_and_emoticons[i - 2].lower() != "very": valence = valence * N_SCALAR elif i > 0 and words_and_emoticons[i - 1].lower() not in self.lexicon \ and words_and_emoticons[i - 1].lower() == "least": valence = valence * N_SCALAR return valence @staticmethod def _but_check(words_and_emoticons, sentiments): # check for modification in sentiment due to contrastive conjunction 'but' words_and_emoticons_lower = [str(w).lower() for w in words_and_emoticons] if 'but' in words_and_emoticons_lower: bi = words_and_emoticons_lower.index('but') for sentiment in sentiments: si = sentiments.index(sentiment) if si < bi: sentiments.pop(si) sentiments.insert(si, sentiment * 0.5) elif si > bi: sentiments.pop(si) sentiments.insert(si, sentiment * 1.5) return sentiments @staticmethod def _special_idioms_check(valence, words_and_emoticons, i): words_and_emoticons_lower = [str(w).lower() for w in words_and_emoticons] onezero = "{0} {1}".format(words_and_emoticons_lower[i - 1], words_and_emoticons_lower[i]) twoonezero = "{0} {1} {2}".format(words_and_emoticons_lower[i - 2], words_and_emoticons_lower[i - 1], words_and_emoticons_lower[i]) twoone = "{0} {1}".format(words_and_emoticons_lower[i - 2], words_and_emoticons_lower[i - 1]) threetwoone = "{0} {1} {2}".format(words_and_emoticons_lower[i - 3], words_and_emoticons_lower[i - 2], words_and_emoticons_lower[i - 1]) threetwo = "{0} {1}".format(words_and_emoticons_lower[i - 3], words_and_emoticons_lower[i - 2]) sequences = [onezero, twoonezero, twoone, threetwoone, threetwo] for seq in sequences: if seq in SPECIAL_CASES: valence = SPECIAL_CASES[seq] break if len(words_and_emoticons_lower) - 1 > i: zeroone = "{0} {1}".format(words_and_emoticons_lower[i], words_and_emoticons_lower[i + 1]) if zeroone in SPECIAL_CASES: valence = SPECIAL_CASES[zeroone] if len(words_and_emoticons_lower) - 1 > i + 1: zeroonetwo = "{0} {1} {2}".format(words_and_emoticons_lower[i], words_and_emoticons_lower[i + 1], words_and_emoticons_lower[i + 2]) if zeroonetwo in SPECIAL_CASES: valence = SPECIAL_CASES[zeroonetwo] # check for booster/dampener bi-grams such as 'sort of' or 'kind of' n_grams = [threetwoone, threetwo, twoone] for n_gram in n_grams: if n_gram in BOOSTER_DICT: valence = valence + BOOSTER_DICT[n_gram] return valence @staticmethod def _sentiment_laden_idioms_check(valence, senti_text_lower): # Future Work # check for sentiment laden idioms that don't contain a lexicon word idioms_valences = [] for idiom in SENTIMENT_LADEN_IDIOMS: if idiom in senti_text_lower: print(idiom, senti_text_lower) valence = SENTIMENT_LADEN_IDIOMS[idiom] idioms_valences.append(valence) if len(idioms_valences) > 0: valence = sum(idioms_valences) / float(len(idioms_valences)) return valence @staticmethod def _negation_check(valence, words_and_emoticons, start_i, i): words_and_emoticons_lower = [str(w).lower() for w in words_and_emoticons] if start_i == 0: if negated([words_and_emoticons_lower[i - (start_i + 1)]]): # 1 word preceding lexicon word (w/o stopwords) valence = valence * N_SCALAR if start_i == 1: if words_and_emoticons_lower[i - 2] == "never" and \ (words_and_emoticons_lower[i - 1] == "so" or words_and_emoticons_lower[i - 1] == "this"): valence = valence * 1.25 elif words_and_emoticons_lower[i - 2] == "without" and \ words_and_emoticons_lower[i - 1] == "doubt": valence = valence elif negated([words_and_emoticons_lower[i - (start_i + 1)]]): # 2 words preceding the lexicon word position valence = valence * N_SCALAR if start_i == 2: if words_and_emoticons_lower[i - 3] == "never" and \ (words_and_emoticons_lower[i - 2] == "so" or words_and_emoticons_lower[i - 2] == "this") or \ (words_and_emoticons_lower[i - 1] == "so" or words_and_emoticons_lower[i - 1] == "this"): valence = valence * 1.25 elif words_and_emoticons_lower[i - 3] == "without" and \ (words_and_emoticons_lower[i - 2] == "doubt" or words_and_emoticons_lower[i - 1] == "doubt"): valence = valence elif negated([words_and_emoticons_lower[i - (start_i + 1)]]): # 3 words preceding the lexicon word position valence = valence * N_SCALAR return valence def _punctuation_emphasis(self, text): # add emphasis from exclamation points and question marks ep_amplifier = self._amplify_ep(text) qm_amplifier = self._amplify_qm(text) punct_emph_amplifier = ep_amplifier + qm_amplifier return punct_emph_amplifier @staticmethod def _amplify_ep(text): # check for added emphasis resulting from exclamation points (up to 4 of them) ep_count = text.count("!") if ep_count > 4: ep_count = 4 # (empirically derived mean sentiment intensity rating increase for # exclamation points) ep_amplifier = ep_count * 0.292 return ep_amplifier @staticmethod def _amplify_qm(text): # check for added emphasis resulting from question marks (2 or 3+) qm_count = text.count("?") qm_amplifier = 0 if qm_count > 1: if qm_count <= 3: # (empirically derived mean sentiment intensity rating increase for # question marks) qm_amplifier = qm_count * 0.18 else: qm_amplifier = 0.96 return qm_amplifier @staticmethod def _sift_sentiment_scores(sentiments): # want separate positive versus negative sentiment scores pos_sum = 0.0 neg_sum = 0.0 neu_count = 0 for sentiment_score in sentiments: if sentiment_score > 0: pos_sum += (float(sentiment_score) + 1) # compensates for neutral words that are counted as 1 if sentiment_score < 0: neg_sum += (float(sentiment_score) - 1) # when used with math.fabs(), compensates for neutrals if sentiment_score == 0: neu_count += 1 return pos_sum, neg_sum, neu_count def score_valence(self, sentiments, text): if sentiments: sum_s = float(sum(sentiments)) # compute and add emphasis from punctuation in text punct_emph_amplifier = self._punctuation_emphasis(text) if sum_s > 0: sum_s += punct_emph_amplifier elif sum_s < 0: sum_s -= punct_emph_amplifier compound = normalize(sum_s) # discriminate between positive, negative and neutral sentiment scores pos_sum, neg_sum, neu_count = self._sift_sentiment_scores(sentiments) if pos_sum > math.fabs(neg_sum): pos_sum += punct_emph_amplifier elif pos_sum < math.fabs(neg_sum): neg_sum -= punct_emph_amplifier total = pos_sum + math.fabs(neg_sum) + neu_count pos = math.fabs(pos_sum / total) neg = math.fabs(neg_sum / total) neu = math.fabs(neu_count / total) else: compound = 0.0 pos = 0.0 neg = 0.0 neu = 0.0 sentiment_dict = \ {"neg": round(neg, 3), "neu": round(neu, 3), "pos": round(pos, 3), "compound": round(compound, 4)} return sentiment_dict if __name__ == '__main__': # --- examples ------- sentences = ["VADER is smart, handsome, and funny.", # positive sentence example "VADER is smart, handsome, and funny!", # punctuation emphasis handled correctly (sentiment intensity adjusted) "VADER is very smart, handsome, and funny.", # booster words handled correctly (sentiment intensity adjusted) "VADER is VERY SMART, handsome, and FUNNY.", # emphasis for ALLCAPS handled "VADER is VERY SMART, handsome, and FUNNY!!!", # combination of signals - VADER appropriately adjusts intensity "VADER is VERY SMART, uber handsome, and FRIGGIN FUNNY!!!", # booster words & punctuation make this close to ceiling for score "VADER is not smart, handsome, nor funny.", # negation sentence example "The book was good.", # positive sentence "At least it isn't a horrible book.", # negated negative sentence with contraction "The book was only kind of good.", # qualified positive sentence is handled correctly (intensity adjusted) "The plot was good, but the characters are uncompelling and the dialog is not great.", # mixed negation sentence "Today SUX!", # negative slang with capitalization emphasis "Today only kinda sux! But I'll get by, lol", # mixed sentiment example with slang and constrastive conjunction "but" "Make sure you :) or :D today!", # emoticons handled "Catch utf-8 emoji such as 💘 and 💋 and 😁", # emojis handled "Not bad at all" # Capitalized negation ] analyzer = SentimentIntensityAnalyzer() print("----------------------------------------------------") print(" - Analyze typical example cases, including handling of:") print(" -- negations") print(" -- punctuation emphasis & punctuation flooding") print(" -- word-shape as emphasis (capitalization difference)") print(" -- degree modifiers (intensifiers such as 'very' and dampeners such as 'kind of')") print(" -- slang words as modifiers such as 'uber' or 'friggin' or 'kinda'") print(" -- contrastive conjunction 'but' indicating a shift in sentiment; sentiment of later text is dominant") print(" -- use of contractions as negations") print(" -- sentiment laden emoticons such as :) and :D") print(" -- utf-8 encoded emojis such as 💘 and 💋 and 😁") print(" -- sentiment laden slang words (e.g., 'sux')") print(" -- sentiment laden initialisms and acronyms (for example: 'lol') \n") for sentence in sentences: vs = analyzer.polarity_scores(sentence) print("{:-<65} {}".format(sentence, str(vs))) print("----------------------------------------------------") print(" - About the scoring: ") print(""" -- The 'compound' score is computed by summing the valence scores of each word in the lexicon, adjusted according to the rules, and then normalized to be between -1 (most extreme negative) and +1 (most extreme positive). This is the most useful metric if you want a single unidimensional measure of sentiment for a given sentence. Calling it a 'normalized, weighted composite score' is accurate.""") print(""" -- The 'pos', 'neu', and 'neg' scores are ratios for proportions of text that fall in each category (so these should all add up to be 1... or close to it with float operation). These are the most useful metrics if you want multidimensional measures of sentiment for a given sentence.""") print("----------------------------------------------------") # input("\nPress Enter to continue the demo...\n") # for DEMO purposes... tricky_sentences = ["Sentiment analysis has never been good.", "Sentiment analysis has never been this good!", "Most automated sentiment analysis tools are shit.", "With VADER, sentiment analysis is the shit!", "Other sentiment analysis tools can be quite bad.", "On the other hand, VADER is quite bad ass", "VADER is such a badass!", # slang with punctuation emphasis "Without a doubt, excellent idea.", "Roger Dodger is one of the most compelling variations on this theme.", "Roger Dodger is at least compelling as a variation on the theme.", "Roger Dodger is one of the least compelling variations on this theme.", "Not such a badass after all.", # Capitalized negation with slang "Without a doubt, an excellent idea." # "without {any} doubt" as negation ] print("----------------------------------------------------") print(" - Analyze examples of tricky sentences that cause trouble to other sentiment analysis tools.") print(" -- special case idioms - e.g., 'never good' vs 'never this good', or 'bad' vs 'bad ass'.") print(" -- special uses of 'least' as negation versus comparison \n") for sentence in tricky_sentences: vs = analyzer.polarity_scores(sentence) print("{:-<69} {}".format(sentence, str(vs))) print("----------------------------------------------------") # input("\nPress Enter to continue the demo...\n") # for DEMO purposes... print("----------------------------------------------------") print( " - VADER works best when analysis is done at the sentence level (but it can work on single words or entire novels).") paragraph = "It was one of the worst movies I've seen, despite good reviews. Unbelievably bad acting!! Poor direction. VERY poor production. The movie was bad. Very bad movie. VERY BAD movie!" print(" -- For example, given the following paragraph text from a hypothetical movie review:\n\t'{}'".format( paragraph)) print( " -- You could use NLTK to break the paragraph into sentence tokens for VADER, then average the results for the paragraph like this: \n") # simple example to tokenize paragraph into sentences for VADER from nltk import tokenize sentence_list = tokenize.sent_tokenize(paragraph) paragraphSentiments = 0.0 for sentence in sentence_list: vs = analyzer.polarity_scores(sentence) print("{:-<69} {}".format(sentence, str(vs["compound"]))) paragraphSentiments += vs["compound"] print("AVERAGE SENTIMENT FOR PARAGRAPH: \t" + str(round(paragraphSentiments / len(sentence_list), 4))) print("----------------------------------------------------") # input("\nPress Enter to continue the demo...\n") # for DEMO purposes... print("----------------------------------------------------") print(" - Analyze sentiment of IMAGES/VIDEO data based on annotation 'tags' or image labels. \n") conceptList = ["balloons", "cake", "candles", "happy birthday", "friends", "laughing", "smiling", "party"] conceptSentiments = 0.0 for concept in conceptList: vs = analyzer.polarity_scores(concept) print("{:-<15} {}".format(concept, str(vs['compound']))) conceptSentiments += vs["compound"] print("AVERAGE SENTIMENT OF TAGS/LABELS: \t" + str(round(conceptSentiments / len(conceptList), 4))) print("\t") conceptList = ["riot", "fire", "fight", "blood", "mob", "war", "police", "tear gas"] conceptSentiments = 0.0 for concept in conceptList: vs = analyzer.polarity_scores(concept) print("{:-<15} {}".format(concept, str(vs['compound']))) conceptSentiments += vs["compound"] print("AVERAGE SENTIMENT OF TAGS/LABELS: \t" + str(round(conceptSentiments / len(conceptList), 4))) print("----------------------------------------------------") # input("\nPress Enter to continue the demo...") # for DEMO purposes... do_translate = input( "\nWould you like to run VADER demo examples with NON-ENGLISH text? \n (Note: requires Internet access and uses the 'requests' library) \n Type 'y' or 'n', then press Enter: ") if do_translate.lower().lstrip().__contains__("y"): import requests print("\n----------------------------------------------------") print(" - Analyze sentiment of NON ENGLISH text...for example:") print(" -- French, German, Spanish, Italian, Russian, Japanese, Arabic, Chinese(Simplified) , Chinese(Traditional)") print(" -- many other languages supported. \n") languages = ["English", "French", "German", "Spanish", "Italian", "Russian", "Japanese", "Arabic", "Chinese(Simplified)", "Chinese(Traditional)"] language_codes = ["en", "fr", "de", "es", "it", "ru", "ja", "ar", "zh-CN", "zh-TW"] nonEnglish_sentences = ["I'm surprised to see just how amazingly helpful VADER is!", "Je suis surpris de voir comment VADER est incroyablement utile !", "Ich bin überrascht zu sehen, nur wie erstaunlich nützlich VADER!", "Me sorprende ver sólo cómo increíblemente útil VADER!", "Sono sorpreso di vedere solo come incredibilmente utile VADER è!", "Я удивлен увидеть, как раз как удивительно полезно ВЕЙДЕРА!", "私はちょうどどのように驚くほど役に立つベイダーを見て驚いています!", "أنا مندهش لرؤية فقط كيف مثير للدهشة فيدر فائدة!", "我很惊讶地看到VADER是如此有用!", "我很驚訝地看到VADER是如此有用!" ] for sentence in nonEnglish_sentences: to_lang = "en" from_lang = language_codes[nonEnglish_sentences.index(sentence)] if (from_lang == "en") or (from_lang == "en-US"): translation = sentence translator_name = "No translation needed" else: # please note usage limits for My Memory Translation Service: http://mymemory.translated.net/doc/usagelimits.php # using MY MEMORY NET http://mymemory.translated.net api_url = "http://mymemory.translated.net/api/get?q={}&langpair={}|{}".format(sentence, from_lang, to_lang) hdrs = { 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3', 'Accept-Encoding': 'none', 'Accept-Language': 'en-US,en;q=0.8', 'Connection': 'keep-alive'} response = requests.get(api_url, headers=hdrs) response_json = json.loads(response.text) translation = response_json["responseData"]["translatedText"] translator_name = "MemoryNet Translation Service" vs = analyzer.polarity_scores(translation) print("- {: <8}: {: <69}\t {} ({})".format(languages[nonEnglish_sentences.index(sentence)], sentence, str(vs['compound']), translator_name)) print("----------------------------------------------------") print("\n\n Demo Done!") |
이와 달리 Predicting sentiment of comments to news on Reddit은 Naive Bayes classifier을 이용하여 분석한 결과를 정리한 논문입니다. Naive Bayes Classifier From Scratch in Python을 보면 코드수준에서 방법을 정리하고 있습니다.
사실 구글로 검색해보면 Reddit을 이용한 다양한 분석프로젝트들이 있습니다. 국내의 경우 곳곳의 카페나 게시판을 대상으로 위와 같은 분석시스템을 만들 수 있지만 가장 큰 문제는 데이타 수집으로 보입니다. API를 통하여 데이타를 수집할 수 있는 곳이 소수입니다. 웹 스크린 스캘핑으로 수집하려면 비용이 높습니다. 그럼에도 장벽이 있으므로 구축해볼만 하지 않을까 합니다. 참고로 Reddit’s Self-Organised Bull Runs이라는 논문이 있습니다.
This paper finds that users who comment on one discussion involving a particular asset are approximately four times more likely to start a new discussion about this asset in the future, with the probability increasing with each additional discussion the user engages in. This is a strong indication that investment strategies are reproduced through social interaction. This is further validated by findings that sentiments expressed in the linked submissions are strongly correlated in a set of spatial regression models. In particular, bearish sentiments seem to spread more than their bullish counterparts.