Site Loader

Predicting Outcome from Tweets for Kickstarter Projects: Sentiment Analysis
Pratik Prabhakar – 117220752
Department of Computer Science
College of Science, Engineering and Food Science
Major: MSc. Data Science and Analytics
Module: CS6500 Dissertation in Data Analytics
Academic Supervisor:
Mr. Adrian O’Riordan
31st August 2018
Social networking platforms have become the most current and trusted medium for sharing one’s reviews or feedback about a new product on a new and upcoming project online. Social networking has presented a wide array of opportunities for people to put their opinion out in the public domain. Various social networking platforms, blogging websites, review websites, etc. are being used for sharing feeds, reviews, and content for a variety of products and projects that are currently live in the market. CITATION Ahm15 l 2057 1 In this project, we will use the data from one of the social networking websites – Twitter, to predict the outcome/performance from real-world examples. One of the most popular micro-blogging websites, Twitter, has emerged as the most used platform for sharing feeds or reviews about new products and projects. We will predict whether a product will be a good buy for the consumers.
In this project, we will use sentiment analysis on the data obtained from Twitter for the KickstarterCITATION Kic l 2057 2 projects and apply text analytics algorithm to mine statistics of the project. Pre-processing of the tweets will be considered as the first step in text analysis. Then we will use the Term Frequency and Inverse Document Frequency techniques to extract the frequency of each word and see it impacts. We will employ three different machine learning algorithms to predict the accuracy/performance of the sentiments from the users using Random Forest, Naïve Bayes, Support Vector Machine, Decision Trees, and KNN.

I the undersigned solemnly declare that the project report PREDICTING OUTCOME FROM TWEETS FOR KICKSTARTER PROJECTS: SENTIMENT ANALYSIS is based on my own work carried out during the course of our study under the supervision of Mr. Adrian O’Riordan.
I assert the statements made and conclusions drawn are an outcome of my research work. I further certify that
The work contained in the report is original and has been done by me under the general supervision of my supervisor.
The work has not been submitted to any other Institution for any other degree/diploma/certificate in this university.

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!

order now

We have followed the guidelines provided by the university in writing the report.
Whenever we have used materials (data, theoretical analysis, and text) from other sources, we have given due credit to them in the text of the report and giving their details in the references.
TOC o “1-4” h z u 1INTRODUCTION PAGEREF _Toc523139096 h 11.1Motivation PAGEREF _Toc523139097 h 11.2Problem Statement PAGEREF _Toc523139098 h 11.3Approach PAGEREF _Toc523139099 h 21.4Structure of Report PAGEREF _Toc523139100 h 32BACKGROUND PAGEREF _Toc523139101 h 42.1Kickstarter PAGEREF _Toc523139102 h 42.2Technical Concepts PAGEREF _Toc523139103 h 42.2.1Tweet PAGEREF _Toc523139104 h 42.2.2Sentiment Analysis PAGEREF _Toc523139105 h 62.3Technologies Used PAGEREF _Toc523139106 h 92.3.1Python PAGEREF _Toc523139107 h 92.3.2R Programming Language PAGEREF _Toc523139108 h 92.4Literature Review PAGEREF _Toc523139109 h 103DESIGN AND METHODOLOGIES PAGEREF _Toc523139110 h 143.1Overview PAGEREF _Toc523139111 h 143.2Tools and Techniques used PAGEREF _Toc523139112 h 143.2.1Twitter API (TweePy) PAGEREF _Toc523139113 h 143.2.2R Libraries Used PAGEREF _Toc523139114 h Package PAGEREF _Toc523139115 h Package PAGEREF _Toc523139116 h Package PAGEREF _Toc523139117 h Package PAGEREF _Toc523139118 h Package PAGEREF _Toc523139119 h 173.3Research Design PAGEREF _Toc523139120 h 173.3.1Data Collection PAGEREF _Toc523139121 h 173.3.2Data Exploration PAGEREF _Toc523139122 h 183.3.3Pre Processing PAGEREF _Toc523139123 h 193.3.4Sentiment Analysis PAGEREF _Toc523139124 h Approach PAGEREF _Toc523139125 h Approach PAGEREF _Toc523139126 h Learning Approach PAGEREF _Toc523139127 h 244Implementation PAGEREF _Toc523139128 h 304.1Data Set Analysis PAGEREF _Toc523139129 h 304.2Pre Processing Implementation PAGEREF _Toc523139130 h 344.3Sentiment Analysis Implementation PAGEREF _Toc523139131 h 374.4Part of Speech Tagging PAGEREF _Toc523139132 h 455Result PAGEREF _Toc523139133 h 475.1Sentiment Analysis PAGEREF _Toc523139134 h 475.2Lexicon Analysis PAGEREF _Toc523139135 h 485.3Machine Learning Analysis PAGEREF _Toc523139136 h 506Conclusion ; Future Scope PAGEREF _Toc523139137 h 556.1Conclusion PAGEREF _Toc523139138 h 556.2Future Scope PAGEREF _Toc523139139 h 567References PAGEREF _Toc523139140 h 58

TOC h z c “Figure” Figure 1: Tweets from a Kickstarter project containing text with username, multimedia (images), URL (shortened URL), likes (Heart sign), retweet (loop sign), comments (bubble sign) and private message (letter sign). PAGEREF _Toc523139141 h 6Figure 2: Google Trends for Sentiment Analysis for a year (Aug 2018 – Jul 2018) PAGEREF _Toc523139142 h 7Figure 3: Google Trends for related topics and queries on Sentiment Analysis PAGEREF _Toc523139143 h 7Figure 4: Various categories of Sentiment Analysis PAGEREF _Toc523139144 h 8Figure 5: A tweet describing the various elements in it and also from the text, it can be seen that the tweet have a ‘Positive’ Sentiment. PAGEREF _Toc523139145 h 9Figure 6: A flowchart of a typical text analysis for sentiment analysis. PAGEREF _Toc523139146 h 14Figure 7: Distribution of Number of Tweets across various categories of Kickstarter PAGEREF _Toc523139147 h 19Figure 8: Preprocessing Flowchart PAGEREF _Toc523139148 h 20Figure 9: A tree map showing the number of words associated with different categories PAGEREF _Toc523139149 h 24Figure 10: Word cloud for Art Category PAGEREF _Toc523139150 h 25Figure 11: Word cloud for Comics Category PAGEREF _Toc523139151 h 25Figure 12: Word cloud for Design ; Tech Category PAGEREF _Toc523139152 h 25Figure 13: Word cloud for Publishing Category PAGEREF _Toc523139153 h 25Figure 14: Bag of Words for Feature Extraction PAGEREF _Toc523139154 h 26Figure 15: Flowchart for Hybrid approach which incorporates topic modeling using lexicon and machine learning approach PAGEREF _Toc523139155 h 28Figure 16: Code Snippet for Twitter API implementation PAGEREF _Toc523139156 h 30Figure 17: Sample Data from the dataset PAGEREF _Toc523139157 h 31Figure 18: Distribution of Favorite and Retweet Counts across Kickstarter Categories PAGEREF _Toc523139158 h 32Figure 19: Distribution of Favorite Count and Retweet Count for projects under Film Category PAGEREF _Toc523139159 h 33Figure 20: Distribution of Favorite Count and Retweet Count for projects under Music Category PAGEREF _Toc523139160 h 33Figure 21: Polarity result returned from tweets PAGEREF _Toc523139161 h 38Figure 22: Plot of Sentiment Categories from Polarity Approach PAGEREF _Toc523139162 h 38Figure 23: Distribution of sentiments using BING lexicon PAGEREF _Toc523139163 h 40Figure 24: Distribution of sentiments using AFINN lexicon PAGEREF _Toc523139164 h 41Figure 25: Distribution of Sentiment Categories using NRC Lexicon PAGEREF _Toc523139165 h 42Figure 26: POS tags for tweets PAGEREF _Toc523139166 h 46Figure 27: Distribution of Sentiment across categories using BING lexicon PAGEREF _Toc523139167 h 50Figure 28: Result of Decision Tree algorithm using word of bags as predictor PAGEREF _Toc523139168 h 51Figure 29: Result of Decision Tree algorithm using SVD technique PAGEREF _Toc523139169 h 51Figure 30: Result of KNN algorithm using SVD technique PAGEREF _Toc523139170 h 52Figure 31: Result using Random Forest algorithm on SVD PAGEREF _Toc523139171 h 52

TOC h z c “Table” Table 1: Proportion of tweets for different Categories of Kickstarter Projects PAGEREF _Toc523139172 h 18Table 2: AFINN Lexicons PAGEREF _Toc523139173 h 23Table 3: Bing Lexicons PAGEREF _Toc523139174 h 23Table 4: NRC Emotion Lexicons PAGEREF _Toc523139175 h 24Table 5: Structure of the Dataset PAGEREF _Toc523139176 h 31Table 6: POS tags and their description PAGEREF _Toc523139177 h 45Table 7: Distribution of sentiment across categories PAGEREF _Toc523139178 h 47Table 8: Kickstarter Statistics PAGEREF _Toc523139179 h 48Table 9: Analysis Result of Sentiment using AFINN lexicon PAGEREF _Toc523139180 h 48Table 10: Analysis Result of Sentiment using BING lexicon PAGEREF _Toc523139181 h 49Table 11: Confusion Matrix using AFINN lexicon PAGEREF _Toc523139182 h 53Table 12: Confusion Matrix using BING lexicon PAGEREF _Toc523139183 h 53Table 13: Accuracy for various classifiers and lexicons PAGEREF _Toc523139184 h 56
INTRODUCTIONMotivationIn recent years, social media has become very powerful and popular media to interact with users to know the viewpoints or reviews from the various consumers and tech gigs across the world. Social media platforms are focused towards creating a virtual bond between its users. It helps the users of the platform to express their thoughts or views about a product or service that they may have used. Users tend to use social media posts, comments, likes, shares as well as messages to put their opinion out in the public. CITATION Naz181 l 2057 3.
Twitter is an online social networking platform which enables users to post and interact with messages known as “tweets”. Tweets are restricted up to 280 characters except for Japanese, Korean and Chinese languages. Twitter has been the largest source of breaking news, with over 319 million monthly active users and also according to the statistics of 2012, more than 100 million users posted more than 340 million tweets in a day. In the year 2013, Twitter has become one of the most-visited websites and has been renowned as “the SMS of the Internet”. CITATION Wik1 l 2057 4
The tweets which are a form of data can be analyzed to predict user’s sentiment and perspective towards the product or service described in the tweet. These sentiments can then be categorized as positive, negative or, neutral. CITATION Had13 l 2057 5 Sentiment analysis has become a very popular technique to classify the texts and it is been used in various fields for investigating the consumer’s attitude towards the market. People use to share their feeds with messages on the social platform such as Twitter and give their viewpoints about various products and/or projects which they think might be beneficial for the market or for the consumers.
Problem StatementThis research is mainly focused towards the ability to predict whether or not a Kickstarter project will be successfully funded or not. This prediction will be made based on the analysis of publicly available Twitter data. Kickstarter is an American public-benefit corporation based in Brooklyn, New York, that maintains a global crowdfunding platform focused on creativity and merchandising. The Kickstarter aims to “help bring creative projects to life”CITATION Kic l 2057 2. “Kickstarter has reportedly received more than $3.7 billion in pledges from 15.08 million backers which funded 407,762 projects with 3,471 live projects across various categories such as films, music, comics, publishing, video games, technology, art and food-related projects.” CITATION sta18 l 1033 6
The data used in this research was extracted from Twitter using Twitter Application Programming Interface (API). The methodology used in this analysis was built using a lexicon approach and machine learning model. This process helped us analyze and predict the point of view and opinions of the users towards a project/campaign/product or service. The general sentiment is then categorized as positive, negative or neutral. This research can be used to gauge the general sentiment of the people. By the end of this research, we will be able to swiftly analyze and create a visualization of the sentiments of the general public. This will help in providing insights into the opinions, thoughts, feedback, or reviews provided by Twitter’s users.

ApproachThis investigation of predicting the outcome from tweets for various Kickstarter projects will consist of gathering the data for the projects using the Twitter API for which we need the tokens and we use Python to extract the information such as Text, Favourite Count, Retweet Counts, Category, Project Name, etc. as Python gives us the full-tweets of the search query and there is a limit for 100 tweets per request. We will identify the tweets using various hashtags (#), keywords, project names, etc to identify the project based on their category.
The major contributions of this work are as follows:
We design a pre-processing technique for cleaning the tweets using various functions.
We use the polarity feature to determine the polarity of the tweets.
We design the bag of words which will be used to build the word cloud for various categories across Kickstarter.
We design the TF-IDF functionality for determining the most frequent terms across the tweets which might be useful in determining the sentiment of the user.

We design various machine learning to extract aspects and the associated sentiments.

Structure of ReportThe structure of this report is as follows:
The next section, Section 2 describes the background and the related work is done using sentiment analysis on Twitter data.
In Section 3, we present a detailed design of the workflow followed in the analysis of dataset and prediction of the outcome.
Implementation of the algorithm and lexicons is discussed in Section 4.
Experimental results are discussed in Section 5.
Section 6 concludes this report and discusses future research opportunities which can be pursued.
BACKGROUNDThis chapter will help the reader to understand the various technical concepts which are introduced to carry out the Sentiment Analysis on tweets.

KickstarterKickstarter is a global community which is built around creativity and creative projects. The projects are the best way to connect with people of various domain, and these are backed by over 10 million people across the globe. CITATION Kic l 2057 2 In this research, the tweets are taken from various Kickstarter projects which are in popular and New in this platform, for which we have taken data from Twitter across various categories. Kickstarter also have data related to it like the amount pledged for the project to be successful, number of backers, how many days left for pledging the project, comments related to the project from various users, updates related to the project if the project has anything related to funding, or project itself. These project has also the option to Back the project if the customer wants some benefit before the successful funding of the project.

Technical ConceptsTweetTwitter is one of the most popular micro-blogging websites in the world. It has a user base of around 200 million which generates approximately 340 million tweets per day. Twitter, as a platform, represents one of the largest and the most dynamic datasets of user-generated content in the world. CITATION Spe11 l 2057 7 CITATION Twi18 l 2057 8
Just like most other social networking websites, the content on Twitter is real time. Tweets about an event, festival, calamity, or a celebration are tweeted during and immediately after the occurrence. This also provides brands and organizations with the opportunity to hear what existing and potential customers have to say about their products and services. This is a very time efficient and low-cost method of measuring the performance or success of a product/brand/campaign/project without the traditional expensive and time-consuming surveys and tedious requests for consumer feedback. CITATION Spe11 l 2057 7Tweets are public, but a user can also send private messages. Users can tweet via the Twitter website, Twitter App through smartphones, and also by “Short Message Service (SMS)” which is available in certain countries. CITATION Wik1 l 2057 4 Users can follow other users, communities, and can also share their feed or tweets by retweeting them. Twitter also allows the use of hashtags in its posts which groups the tweets together by the hashtags used. Similarly, a user can post a tweet specifying a particular user using “@” sign followed by the username. CITATION Wik1 l 2057 4Based on the web-traffic analysis of the website Alexa CITATION Wik181 l 2057 9, Twitter is ranked the 12th most visited website in the world. According to CITATION Wik181 l 2057 9 blog entry, Twitter ranked as a 10th most used social network on the basis of 319 million monthly visitors in April 2017.

Twitter follows the principle of followers. So, when any user chooses to follow another user, then the user’s tweet will appear in reverse chronological order on your Twitter main page. CITATION Wik1 l 2057 4Tweets can be encompassed by the following CITATION Soc11 l 2057 10:
Text: The text or message user want to deliver. All the messages usually are of short length limiting up to 280 characters which are used to express the opinions. The text can be embodied with hashtags “#” which is usually followed by texts. Each hashtag has some community associated with it. These are a great way of communicating and taking notes through crowdsourcing. These texts can also contain the username using “@” followed by username which can be used to mention the user for that particular tweet.

URL: A shortened URL allows the user to retain characters, and often provides some analytic measurements. This shortened URL address helps the user to identify the full URL by hovering over the link. Examples: (,, etc.)
Image or Video: Users can tweet any image or video to makes their opinion on some topic and this doesn’t restrict up to 280 characters. These multimedia features enable the user to understand the context of the tweet in some cases. The media may not be useful in determining the sentiment, which we do not consider in this prospect.

Figure SEQ Figure * ARABIC 1: Tweets from a Kickstarter project containing text with username, multimedia (images), URL (shortened URL), likes (Heart sign), retweet (loop sign), comments (bubble sign) and private message (letter sign).A tweet can be retweeted if the user wants to share the opinion with their followers or it can be liked by someone if someone agrees with the opinion shared by the original writer.
Tweets were also used in the various electoral campaigns across different countries and various sports games and movie reviews, etc so get the opinion about some topics and also users can poll their views or votes for some topics. CITATION Soc11 l 2057 10Sentiment AnalysisSentiment Analysis refers to the use of Natural Language Processing along with statistics which systematically identifies, extracts and studies the information by characterizing the different sentiment elements for the textual data. Sentiment Analysis aims to determine the attitude of the user with respect to the overall contextual polarity and/or emotional reaction to a document, interaction, or event. CITATION Sen18 l 1033 11According to Google Trends CITATION Goo l 2057 12, sentiment analysis has gradually become popular globally, and below is a graph that shows how it has been trending over the years:

Figure SEQ Figure * ARABIC 2: Google Trends for Sentiment Analysis for a year (Aug 2018 – Jul 2018) CITATION Goo l 2057 12Some of the trending topics which were used for sentiment analysis can be seen below:

Figure SEQ Figure * ARABIC 3: Google Trends for related topics and queries on Sentiment AnalysisCITATION Goo l 2057 12Sentiment analysis (SA) is used to classify the subjectivity of the text as positive, negative or neutral. SA is the one of the emerging technologies that is helping people utilize past amount of user-generated content which is publicly available online in the form of reviews, blogs, and content from social media platforms. CITATION Lee08 l 2057 13Sentiment analysis utilizes data mining processes to extract the data for analyzing the subjective opinion generated from a collection of documents like blog posts, reviews, and content from social media.
This method of analysis can be used by various organizations for:
tracking the way in which a brand is received by the consumers.

measuring the popularity of a product.

learning about a product perceived by the consumers.

gauging the company’s reputation.
CITATION Tec l 2057 14According to most researcher, the term “sentiment” is used in order to analyze the text automatically and track the predictive judgment of the text. Also, there are various research and publications mentioning sentiment analysis as a focus on classifying reviews based on their polarity (either positive or negative). The classification of text into positive or negative is so important that various problems in question can be formulated by applying the classification, or regression, or ranking given to the textual contents CITATION Lee08 l 2057 13.

Dr. Furu Wei mentions sentiment analysis which is also known as opinion mining that can be used “to understand the attitude of a speaker, or a writer with respect to some topic. The attitude may be their judgment or evaluation, their affective state or the intended emotional communication”. CITATION DrF l 2057 15Sentiment can be analyzed in three different contexts or levels: Document Level, Sentence Level, or Expression Level which can use for different topics. Also, a sentiment can consist of mainly three parts: the person who expresses the sentiment (opinion holder), the audience to whom they are expressing their feelings (opinion target), and the nature of the feeling (opinion content/context) (positive or negative).
Sentiment Analysis usually examines by studying text, posts, which were uploaded by users on microblogging websites, forums, and other review websites, which can be used for suggesting the opinions regarding some product, service, event, person or their ideas. CITATION Nan18 l 2057 16. The image shows the various categories in which sentiments can be defined or considered.

Figure SEQ Figure * ARABIC 4: Various categories of Sentiment Analysis CITATION Nan18 l 2057 16Xiaodan, Svetlana, and Saif Mohammad CITATION Xia14 l 2057 17 discussed that detecting the sentiment of tweets automatically has attracted wide interest from various academic and other business industries.

Figure SEQ Figure * ARABIC 5: A tweet describing the various elements in it and also from the text, it can be seen that the tweet have a ‘Positive’ Sentiment. CITATION Twi18 l 2057 8Technologies UsedPythonPython is an interpreted, object-oriented and high-level interactive programming language which is used for the general-purpose programming build by Guido van Rossum at Centrum Wiskunde ; Informatica in the year 1989. Python supports multiple programming patterns and is featured as a dynamic-type system to give an automatic memory managing functionality. It supports imperative, functional, object-oriented and procedural programming, and also has a very large standard library to work with. CITATION Wik181 l 2057 9 CITATION Mat17 l 2057 18In this research, we are using Python 3.0 version which was released in December 2018 featuring backward compatibility with the previous versions of Python. Python’s large standard library is one of the greatest strength providing support to various tasks.
Python has been used in various AI projects and also as a scripting language with some modular architecture. Python uses the Natural Language Processing.
R Programming LanguageR is a free statistical programming language environment. R has an extensive variety of statistical and graphical techniques for modeling and analyzing the data. According to the software Quality company CITATION TIO18 l 2057 19, R is ranked 10th in various popular programming languages. R is built on C programming language, and its extensible through various functions in terms of its active contribution of various libraries/packages. CITATION Wik181 l 2057 9We are using R 3.4 version with RStudio as the Graphical User Interface. R distribution is supplied with about eight packages and also there are various available packages through CRAN family which covers a wide range of statistics functions. R also has its own documentation format know as LaTex which is used for comprehensive documentation. CITATION RFo18 l 2057 20 We are using various text-mining packages for cleaning data from tweets and also use some packages for analyzing and predicting the sentiments using various modeling.

Literature Review”Twitter Sentiment Analysis (TSA) tackles the problem of analyzing the messages posted on Twitter in terms of the sentiments they express. Twitter is a novel domain for sentiment analysis (SA)” CITATION Gia16 l 2057 21A huge amount of research on sentiment analysis has been undergoing mainly on data from social media and forums. This section will give us an overview of the related research which has been undertaken and which can be used as a foundation for this research.
Sentiment Analysis is determined by using mainly three approaches as discussed by Emma, Xiaohui and Yong CITATION Had13 l 2057 5 i.e., machine learning based, lexicon based and linguistic analysis method. The machine learning approach is based on training an algorithm, mostly classifying some set of features and then test on another set of data and predict whether the classification is rightly featured or not. A lexicon approach depends on some predefined list of lexicons/corpus of words with some score/polarity for it. This algorithm can be used to search those words from the data set and then counting them to estimate the weights and overall score for that particular text. Whereas the linguistic approach mainly focuses on using the syntactic characteristics of the words or phrases which are used to determine the text orientation. This approach is mainly used with the previous one (lexicon-based method).

Sentiment analysis has been originated from the computer science concept rather than any linguist. According to B. Liu, sentiment infers the underlying positive, negative or neutral thoughts disguised by the opinion. CITATION Liu15 l 2057 22As per the given character limitations on tweets, classifying the messages for predicting sentiment is similar to sentence-level sentiment analysis. CITATION YuH03 l 2057 23Prerna, Soujanya, and Erik, developed their Sentiment Analysis System based on two classifiers – Rule-based and Supervised, integrated together. For Supervised classifier, they have classified the sentiment into multiclass classification problem after removing the emoticons, where the tweets are labeled as ‘positive, negative and neutral’ and used a Support Vector Machine- linear and L1 regularization kernel algorithm to classify it. For rule-based classifiers, they have taken emoticons into consideration for classifying the tweets into those three categories. CITATION Chi15 l 2057 24 Xiaodan, Svetlana, and Saif Mohammad presented their analysis and made their model using the supervised statistical systems for term-level and message-level sentiments of tweets. They used surface-form, semantic and sentiment features to incorporate their system which was a significant improvement over the previous system. CITATION Xia14 l 2057 17Agarwal used three types of model to experiment with their findings: unigram, feature-based and a tree kernel-based machine learning model. The feature-based model was used with Twitter-specific features like emoticons, hashtags, URLs, etc. but very marginally affected the model. The features which use the combined prior polarity of the terms with the POS (part-of-speech) tagging are the most important for classifying the sentiments. The tree kernel-based model was designed to represent the tweets in a tree format, and the unigram model was used as a baseline model for predicting the sentiments of the tweets. CITATION Aga11 l 2057 25One key for analyzing the sentiment is to recognize, review and give some weights for the text and identify it as positive, negative, or neutral. There are two vital methods to summarize the text: abstractive and extractive. CITATION Bag17 l 2057 26
Khaled CITATION Ahm15 l 2057 1 suggests that sentiment analysis can be done using four different levels: sentence, document, aspects, and user level which can be achieved using different machine learning techniques and also use NLP, Ontology, or some hybrid methodology. They have also discussed Sentiment Analysis Enhancement methods which comprise data cleaning, dimensionality reduction, etc.

Grouping of texts and classifying into various sentiments depends on an extremity, which suggested that the first one is the immediate conclusion from the user to the target audience, the second one is the similar possibility of the user to think among the target, and the third one is the conclusion from the parts of speech or communication of the user done in contempt. There is an approach for classifying sentiment using dictionary-based which considers an authoritative pointer for referencing the words from the vocabulary. CITATION Alh16 l 2057 27Feature vectors and a collection of corpora are used as feature vectors for modeling and these features are vital in classifying the success rate of the prediction. An enormous range of unigrams and n-grams technique is also used for feature vector. The Naïve Bayes algorithm is used for classifying the sentiments based on probability and it works on strings of words as a feature matrix. The Support Vector Machine distinguishes by making a non-linear decision plane within the native feature and can be separated using a hyperplane for identifying the classes. CITATION Kau16 l 2057 28SA is the ability to extract independent information from texts in natural language, in order to create an actionable and structured system by some decision making algorithms. Also, the connection (followers) plays a significant role in determining the sentiment but it has been opposed that it might be a weak assumption, so a system is designed which takes the polarity, and approval relations that can better fulfill the principle of homophily. CITATION Alb13 l 2057 29Corpus can be made from the tweets and classified into the sentiments, there are basically two types of emoticons which are identified: Happy (“:-)”, “:)”, “=)”, “:D”, etc.) and Sad (“:-(“, “:(“, “=(“, “;(“, etc.) and these can be used as a feature to classify the sentiments. CITATION Pak10 l 2057 30Finn CITATION Åru11 l 2057 31 discovered a new word list with sentiment strength associated with each word and also included the Internet slang and offensive words associated with it. He has added the words from the public domain “Original Balanced Affective Word List” CITATION htt l 2057 32, added the slang and acronyms from the Urban Dictionary CITATION Urb l 2057 33, and some word lists from “Compass DeRose Guide to Emotion Words” CITATION Ste05 l 2057 34. These words were compared with the mean valence of ANEW(Affective norms for English words) CITATION Bra99 l 2057 35Alec CITATION GoA l 2057 36 classified the sentiments into two classes: personal positive, or negative feeling and this was carried out using different machine learning classifiers: keyword-based, Naïve Bayes, maximum entropy and SVM. The feature vectors which was mainly used here was emoticons present in tweets giving a 80% accuracy with training data alone.
Bei CITATION YuB08 l 2057 37 evaluated the sentiments for various topics, and the outperformed model was Support Vector Machine giving a high accuracy in classifying the erotic poem. The self-feature selection was used in Naïve Bayes and SVM and it improved the performance in both of the algorithms, although they have used selected relevant features in different frequency ranges, the evaluation was done after reducing the features using different approaches such as stemming, and removal of stop words.
Sánchez-Mirabal et al.CITATION Sán14 l 2057 38 designed the system of predicting the sentiments on the basis of polarity of words and text similarities. The features are taken as numbers associated with each word and scores were calculated. Their model achieved an average performance using the KNN algorithm using just 61 features in the training set.

DESIGN AND METHODOLOGIESThis chapter focusses on the various design concepts and the methodologies used to carry out the research. It also describes the lexicographical approach available to analyse the sentiments.
OverviewThe approach used in this research consists of three major phases; obtaining tweets from the Twitter API, cleaning the tweets, and exploring the tweet data. After cleaning, the tweets are classified into their contextual sentiments, and then labelled accordingly. Polarity of the text is identified using the library available. In the next step, the sentiment scoring is calculated to identify the three sentiment classes: positive, negative and neutral. Machine learning algorithms are applied after summarizing the text, and the sentiments are visualized.

The process followed for carrying out sentiment analysis is demonstrated below in the form of a flowchart:

Figure SEQ Figure * ARABIC 6: A flowchart of a typical text analysis for sentiment analysis.Tools and Techniques usedTwitter API (TweePy)An application programming interface (API) is a set of functions, procedures and protocols which facilitates the building of a software by inter-communication within interface. In this research, real-time Twitter data is analysed. This data is acquired using the Twitter API. Below is a stepwise process of how the tweets were acquired:
Create a Twitter App, for which one needs to sign up for a Twitter account CITATION Twi18 l 2057 8.

Using the Twitter Developer portalCITATION Twi181 l 2057 39 , one needs to select ‘My Applications’.

Then select ‘Create a new Application’ by filling the necessary fields and then click on ‘Create your Twitter Application’.

Now, create the Access Token, in order to get the tweets on a particular topic or query.

Now, select the Application access type; by default it is set to Read.

Upon completion of the above steps, you will be directed towards the OAuth settings with Consumer Key, Consumer Secret, OAuth Access Token, and OAuth Access Token Secret keys which will be used in the API to fetch the tweets by query or topic.

TweePy is an easy-to-use library from Python which is built for accessing the Twitter APICITATION Pab09 l 2057 40. The functions from this library are used for getting the tweets for various Kickstarter projects. TweePy handles authentication using the OAuth tokens and connects to Twitter. TweePy has several functions to extract the tweets, whether it’s from a user’s timeline, direct messages to or from any user, or querying with a topic. We are using the ‘search’ method of TweePy which takes the topic, or the hashtag, or a keyword to extract the tweet using the streaming feature.
R Libraries Usedcaret PackageCaret (short for Classification And REgression Training) is used to rationalise the model training for complex regression and classification problems. This package is used to partition the data into training and test datasets. It is also helpful in the feature extraction process. The major functionality is the train the function which is used to evaluate the model using resampling technique. It is also important to choose an optimal model across these parameters which will then be used to estimate the model performance from a training dataset. CITATION CRA18 l 2057 41 Caret package is very simple and easy to use.

Syntax: library(caret)
qdap Packageqdap (short for Quantitative Discourse Analysis Package) is used for bridging the gap between Numerical and Qualitative Data Analysis. It provides parsing tools for transforming the data which includes frequency counts of sentence, words, etc. CITATION CRA18 l 2057 41. Some of the key functionalities of QDAP are:
Preparing transcripts of data.

Summarizing the frequency counts of words, sentence types, sentences, and syllables.

Aggregating data by grouping the variables.

Extracting words and visualizing it.

Analysing the data statistically.

CITATION Wik181 l 2057 9 quanteda Packagequanteda (short for Quantitative Analysis of Textual Data) is built for creating corpora from text data. It is a fast and flexible tool for processing, managing, and analyzing the textual data in R. Various functions are present to tokenize the text into words, and to use them to manipulate them into a document-feature matrix.
Some of the features of this package include:
powerful and flexible tools for working with dictionaries
ability to identify keywords associated with documents or groups of documents
ability to explore texts using key-words-in-context
fast computation of a variety of readability indexes
fast computation of a variety of lexical diversity measures
quick computation of word or document similarities, for clustering or to compute distances for other purposes
comprehensive suite of descriptive statistics on text such as the number of sentences, words, characters, or syllables per document
graphical tools to analyse the data
CITATION CRA18 l 2057 41syuzhet PackageThis package is mainly used to extract sentiment and sentiment-derived plot from the text using a variety of lexicons: AFINN, BING, NRC, etc. Syuzhet tries to reveal the latent semantics of a sentence by means of sentiment analysis technique. The lexicons which are used are publicly available for research purposes. Using this package, one can also customize their own lexicons in terms of sentiments, and analyse them in order to get some more insights about the text/data. CITATION CRA18 l 2057 41ggplot Packageggplot2 is based on the Grammar of Graphics which provides a declarative system for graphics. This package is used for data visualisation and uses input as a data frame and a set of aesthetics. This package is very flexible and can be used to plot at a high level of abstraction. CITATION CRA18 l 2057 41There are various other libraries which are used in this research to carry out the process of analysing sentiments. Some of the packages are based on modeling a machine learning algorithm.

Research DesignThe research is divided into five stages:
Data Collection
Data Exploration
Sentiment Analysis
Modeling and predicting the accuracy
Data CollectionThe data is collected for 30 projects, across 8 categories of Kickstarter programs. For building the data, we have restricted the minimum number of tweets per project to be 15. We have taken the maximum number of projects as 30 for each category. The data is used for initial pre-processing tasks like cleaning the tweets and getting rid of the special characters, un-useful information and URLs, etc. by using different approaches in R. This will provide a comparative study of the various datasets with several dimensions including the total number of tweets, vocabulary size, and sparsity. We have also gathered the statistical data such as percentages funded for the project, a number of backers, pledge for the project, and the goal fulfilled, etc. to compare with the percentage of positive sentiments and predict whether that particular project has done well or not.

Data ExplorationWe have used data visualization tools like Tableau, which produce interactive data visualization and it makes easier for novice and experienced users to develop smart capabilities in the areas like Data Analysis, Data Preparation, NL Query, and prediction. CITATION Tab17 l 2057 42Data exploration is important aspect of the process because it gives an initial visual exploration about the characteristics of the dataset, relatively through old-style data management systems. This also refers to ad-hoc analysis of the data which identifies the potential relationships or insights which may be concealed inside various features in the dataset. This technique may be useful in formulating a hypothesis that could lead to exploring a new insights. CITATION Tab17 l 2057 42This tool is used to see the overall structure of the data sets including the number of retweets, number of favourite counts, number of tweets per category and also see whether any category is worth interest to make an assumption that the projects present in that category are doing good or/and whether the funding for the project is worth the money in terms of market response.

The proportion of tweets for various categories can be summarized as below:
Art Comics and Illustration Design and Tech Film Food and Craft Games Music Publishing
12.7% 14.10% 12.7% 14.1% 10.9% 13.4% 9% 12.8%
Table SEQ Table * ARABIC 1: Proportion of tweets for different Categories of Kickstarter ProjectsThe following graph REF _Ref522196760 h Figure 7 shows the proportion of data across different Kickstarter project categories which contains 30 projects in each category.

Figure SEQ Figure * ARABIC 7: Distribution of Number of Tweets across various categories of KickstarterPre Processing”Data pre-processing is a data mining technique which involves transforming raw data into an understandable format.” CITATION Tec l 2057 14Data Pre-processing consists of various stages:
Data Cleaning: Cleaning of data is done using various processes such as detecting noisy data, filling missing values present in data, and determining the inconsistency within the data.

Data Integration: Data with different representations are put together and conflicts within the data are resolved.

Data Transformation: Normalization of data is done in order to aggregate it and generalize it.

Data Reduction: Reducing the data and representing it in a way that the feature is restored.

CITATION Tec l 2057 14
Figure SEQ Figure * ARABIC 8: Preprocessing FlowchartThe pre-processing includes analysing the most popular category which can be compared with the actual statistics which is available on the Kickstarter website. The cleaned tweets are used for tokenizing the words and forming the word cloud for each category. This gives us the insights for the most popular project under each category. The tokenized words are then used to find the term-frequency across the categories. Also, we have used different lexicons to get the initial sentiments of the users. The following section will show data gathering, lexicons used, methodology for normalizing the text data, classifiers used, and future scope of research.
We will use tweets as a source of social media messages. Using the Twitter API, there are libraries and packages available in R and Python that can be used for extracting the tweets. We tokenize the tweets using the lexical normalization or stemming of the different words, to identify the related words. We chose a simple, high-precision approach based on the presence of hashtags in tweets. We manually create a list of hashtags and keywords associated with each project, using search queries on Twitter.
We have used the following methods to clean the tweets (text data) using the various libraries in R:
Contraction Replacement: There are various contractions used in tweets and we have used this technique to eliminate the contraction to make it a meaningful sentence. For example: I’m, I’ll, We’ve, We’re, etc. are converted to I am, I will, We have and We are respectively, as it may inverse the sentiment of the sentence.

Elimination of Numbers: It is common to remove numbers from the tweets because they do not contain any sentiment, and hence will be replaced with a blank.

Elimination of stop-words: The Stop-words are the functional words which have a high frequency of presence across all sentences. It is not useful to analyze them as they do not contain any useful information. In our implementation, we have used the standard stop-words provided by a library known as ‘quanteda. Return various kinds of stop-words with support for different languages. In our case, we are using English language stop-words.

Elimination of special characters: The special characters such as @, $, %, #, (, ), , |:; !, etc. are removed from the tweets as those characters don’t have any meaning in the sentences and they cannot be used in feature engineering. We use the pattern matching and replacement function ‘gsub’ function available in R.

Replacing the URL: In Twitter, almost every other tweet contains a URL of the related project. The URLs do not contain any useful information that can be used for analysing the sentiment. So, one approach is to replace them using the ‘gsub’ function in R with a blank.
Replacing Retweet (RT) tokens: In Twitter, tweets may contain the ‘RT’ text which signifies a Retweet, their presence does not contain any useful information that can be used for predicting any sentiment. So, one approach is to replace them using the ‘gsub’ function in R with a blank.
Replacing user-id’s from tweets: In Twitter, the tweets may contain @userid/@username and its presence does not contain any useful information which can be used for analysing any sentiment. So, we replace those texts with an empty text.

Standardising the text: Standardisation of the text is important to make all the tweets throughout the data set in the same format. In this case, we are using the to_lower function to make the tweets into lower-case and standardizing them for further analysis.

Stemming: It is the process of removing the endings of the words in order to detect their root form. By doing so, many words are merged and the dimensionality is reduced. It is a widely used method that generally provides good results; we used the ‘quanteda’ package available in R which have the function “tokens_wordstem()” which uses Snowball word stemming of English language.

Sentiment AnalysisThe sentiment analysis is classified into three categories: Positive, Negative, Neutral. This is carried out using four different approaches: Polarity Approach, Lexicon Approach, Machine Learning Approach, and Hybrid Approach. Out of these, the best one is used to predict the sentiment of the Kickstarter projects.

Polarity ApproachPolarity is also known as orientation and can be expressed in terms of emotions present in the sentence. The polarity can be positive or negative and there is a score associated with each word present in the sentence. When finding the polarity, it returns the word count present in sentence, positive words, negative words, and the polarity score associated with the sentence. The polarity score is used to categorise the sentiments into Positive, Negative, or Neutral, and the same is used to label the data based on the summary statistics of the polarity scores for the training data.
Lexicon ApproachMost of the researchers use the Lexicon based approach to identify the sentiments. After the pre-processing task, the tokenized words are used to identify the sentiments according to the lexicons with the score associated with them. In this approach, each tokenized word from a tweet is assigned a sentiment score, and that score is used to help in the classification of sentiments.

There are three types of lexicons used in this research: AFINN lexicons, Bing lexicon and NRC lexicon. All the three lexicons are based on unigrams which is having a single word.
AFINN lexicon was built by Finn Årup Nielsen in 2009-2011, and has a list of English words which have ratings ranging from minus five (negative) to plus five (positive). These words are manually labelled in a file and are tab-separated. The new word list consists of 2477 words and phrases, and is known as AFINN-111. CITATION Åru11 l 2057 31
# A tibble: 2,476 x 2
word score
;chr; ;int;
1 abandon -2
2 abandoned -2
3 abandons -2
4 abducted -2
5 abduction -2
6 abductions -2
7 abhor -3
8 abhorred -3
9 abhorrent -3
10 abhors -3
# … with 2,466 more rows
Table SEQ Table * ARABIC 2: AFINN LexiconsThe Bing lexicon was created by Bing Liu. It consists of a list of 6,788 positive and negative words in English language. These words are categorized in a binary fashion and have two categories: positive and negative. CITATION Liu15 l 2057 22# A tibble: 6,788 x 2
word sentiment
;chr; ;chr;
1 2-faced negative
2 2-faces negative
3 a+ positive
4 abnormal negative
5 abolish negative
6 abominable negative
7 abominably negative
8 abominate negative
9 abomination negative
10 abort negative
# … with 6,778 more rows
Table SEQ Table * ARABIC 3: Bing LexiconsNRC lexicon is a list of English words that are associated with ten different emotions namely: anger, anticipation, disgust, fear, joy, negative, positive, sadness, surprise, trust, out of which there are only two sentiments: positive and negative. CITATION Moh13 l 2057 43 The lexicons consist of a total of 14182 unigram English words. This lexicon was built by Saif Mohammad which was made from the Crowdsourcing and is available in different languages. CITATION Moh11 l 2057 44# A tibble: 13,901 x 2
word sentiment
;chr; ;chr;
1 abacus trust
2 abandon fear
3 abandon negative
4 abandon sadness
5 abandoned anger
6 abandoned fear
7 abandoned negative
8 abandoned sadness
9 abandonment anger
10 abandonment fear
# … with 13,891 more rows
Table SEQ Table * ARABIC 4: NRC Emotion Lexicons
Figure SEQ Figure * ARABIC 9: A tree map showing the number of words associated with different categories CITATION Moh11 l 2057 44Machine Learning ApproachAfter pre-processing, we used different approaches in order to train our model through machine learning algorithms: Word Cloud, Term-Frequency and Inverse-Document Frequency, Supervised vs Unsupervised Learning.

Word Cloud:
To make the word-cloud for various categories to see which project or word is the most popular among the tweets. A word cloud (or tag cloud) can be a handy tool when you need to highlight the most commonly cited words in a text using a quick visualization. CITATION Wik181 l 2057 9
We are using the “wordcloud” library available in R to visualize the frequency of words in tweets for various categories. Below are some of the wordclouds generated using R for various categories present in Kickstarter projects.

Figure SEQ Figure * ARABIC 10: Word cloud for Art Category
Figure SEQ Figure * ARABIC 11: Word cloud for Comics Category
Figure SEQ Figure * ARABIC 12: Word cloud for Design ; Tech Category
Figure SEQ Figure * ARABIC 13: Word cloud for Publishing CategoryTF-IDF
The second technique we used was extracting features by using bag-of words technique. Each sentence/tweet is mapped to a bag of words. We then calculate the term frequency of those words present in the document.

Figure SEQ Figure * ARABIC 14: Bag of Words for Feature ExtractionTerm Frequency (TF), is used to measure the frequency of a term occurring in a document. As every document differs in length, so it is possible that one term may appear multiple number of times if a document is long. Thus, the term frequency is divided by its document length for normalization.

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).

Inverse Document Frequency (IDF), measures the importance of a term. All terms are considered equally while computation.
IDF(t) = log(Total number of documents / Number of documents with term t in it).

The tf-idf weight is a used in text mining. This is used to evaluate how important a word is for a document in a collection or in a corpus. The importance of the word increases proportionally to the number of times it appears in the document and is counterbalanced by the frequency of the word present in the corpus. In this research, tf-idf can be effectively used for filtering the stop-words from the tweets which include text summarization and classification. CITATION TFI l 2057 45After term-frequency, we calculate the inverse-document frequency for those words and in turn the Term Frequency and Inverse-Document Frequency is calculated for further processes. As per Wikipedia, “tf–idf or TF-IDF, short for term frequency-inverse document frequency is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling.” CITATION Wik181 l 2057 9 The same concept is used to identify the term frequency across the tweets so that it will be easy for us to get a clarity about which term is the most used across the documents. We are also going to normalize the text using this technique and eliminate the document that does not contain the terms/words.
We will be using different machine learning algorithms to train our model to predict the sentiment.

Supervised Learning vs Unsupervised Learning
Supervised learning is the process which uses the input variable (X) to predict the output variable (Y) using some functions. A supervised learning algorithm can be grouped into classification and regression problems which can be used for categorical variable problem and a quantitative variable respectively. In this type of learning, all the data is labelled and it learns to predict the output from the input after training. Some of the supervised learning algorithms are Linear Regression, Random Forest, Support Vector machine, etc.

Unsupervised learning is the process where we only have the input variable (X) and the goal of this learning is to train the model in order to learn the structure of data and underlying distribution of the data. This contains unlabelled data and can be grouped into Clustering and Association. Some of the unsupervised learning algorithms are k-means clustering, Apriori algorithm.

In this research, we used the supervised learning which was used to classify the sentiments into three different classes, since the output was already known. We manually labelled the tweets as Positive, Negative, and Neutral for the training set using the polarity functionality, and we used the training model to test our prediction in the test dataset. The dataset is first splitted into training and test dataset with 9036 tweets and 3867 tweets respectively.

We used Naïve Bayes, KNN, Random Forest and Support Vector Machine to train our model and access their performance to identify the accuracy. The features/labels will be used are the words which was extracted as bag-of-words and the output variable will be the Sentiments.

Hybrid Approach
This approach is based on the combination of Lexicon and Machine Learning Approach. The scores for each tweet are computed through the Lexicon approach, and the TF-IDF features are combined together to train the model after reducing the dimension of the features using Singular Value Decomposition (SVD) technique and Latent Semantic Analysis (LSA). The below flowchart gives us an idea as how the process works:

Figure SEQ Figure * ARABIC 15: Flowchart for Hybrid approach which incorporates topic modeling using lexicon and machine learning approach CITATION Sil17 l 2057 46Latent Semantic Analysis (LSA) and Singular Value Decomposition (SVD)
“Latent semantic analysis (LSA) is a technique in natural language processing, for analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms.
A matrix containing word counts per paragraph (rows represent unique words and columns represent each paragraph) is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns.
Words are then compared by taking the cosine of the angle between the two vectors (or the dot product between the normalizations of the two vectors) formed by any two rows. Values close to 1 represent very similar words while values close to 0 represent very dissimilar words.” CITATION Wik181 l 2057 9The mathematical form of SVD can be shown as: “Suppose M is a matrix of the form (m X n) dimensions, there exists a singular value decomposition of M, of the form:
M = U ? V*
U is an unitary matrix of dimension (m x m).
? is a diagonal matrix of dimension (m x n) with non-negative real numbers.

V is a unitary matrix of dimension (n x n).

V* is the conjugate transpose of the n x n unitary matrix.

The diagonal values of ? are known as Singular values of M.” CITATION Sin17 l 2057 47
IMPLEMENTATIONData Set AnalysisThe datasets acquired from Twitter to perform sentiment analysis for various Kickstarter projects are for the Newly Launched and most Popular projects across 8 categories namely: Art, Design ; Tech, Food ; Craft, Comics ; Illustrations, Film, Music, Games, and Publishing. The projects were identified by their number of tweets which was restricted to 15 in this research. The tweets were collected based on the Kickstarter keyword and the project name as the query parameter, and from the various result set parameters collected using Twitter API (TweePy). The following code snippet shows the implementation of Twitter API using Tweepy:

Figure SEQ Figure * ARABIC 16: Code Snippet for Twitter API implementationWe have collected few of them to identify the useful parameters to do the research. These parameters are written into a data frame using Pandas library available in Python which was then saved into a Comma Separated Values (CSV) file.
The following table below which show the structure of the Data Set built for the research:
Attribute Type Meaning
FavoriteCount Number Likes of the tweet
RetweetCount Number Retweet count of the tweet
Source Character From which source the tweet was done
Text Character Text of the tweet
Category Factor Category of Kickstarter project
Project Factor Project name of Kickstarter
Table SEQ Table * ARABIC 5: Structure of the DatasetThe following snapshot shows the data which is used for the research:

Figure SEQ Figure * ARABIC 17: Sample Data from the datasetWe also tried to visualize the relationship between the Favorite Count and Retweet Count to get an insight as which category of Kickstarter is the most popular and most successful. This visualization gave us the insight that Food ; Craft has the most counts of Retweet but Art is doing better in terms of most number of users are liking the projects launched in this category. The best one here is Film which have the second highest number of Retweet Counts, and Favorite Counts. Based on the Kickstarter statistics, Film is really doing good in terms of number of Successfully funded projects till now (25,233 projects), which is the second most funded category in Kickstarter domain. CITATION sta18 l 2057 6
Figure SEQ Figure * ARABIC 18: Distribution of Favorite and Retweet Counts across Kickstarter CategoriesThe distribution of Favorite Counts and Retweet Counts for various projects under each category was also visualised. Retweet Counts can be counted as a valuable sign for the audience who is sharing the tweet with their community or friends and the Favorite Count can be considered as a sign of appreciation by the audience. Though it is difficult to predict whether a Retweet is really worth considering for analysing the overall outcome, but we can get some insights as whether the project is really doing good overall or not.

The below graphs shows the various Favorite and Retweet Counts visualization for the popular projects in Kickstarter.

Figure SEQ Figure * ARABIC 19: Distribution of Favorite Count and Retweet Count for projects under Film Category
Figure SEQ Figure * ARABIC 20: Distribution of Favorite Count and Retweet Count for projects under Music Category
Pre Processing ImplementationPreprocessing consists of several different steps as we discussed in our previous section REF _Ref521928572
h 3.3.3, the implementation is discussed in details as follows:
Contraction Replacement
The tweets are taken as input for the contraction replacement, the “qdap” library is used to replace contraction from tweets, the function “replace_contraction” which takes the text as an argument, and returns a vector with contractions replaced with long form.
The following code shows the contraction replacement in action:
I’m a bit disappointed I can’t backup @scribit @Kickstarter, which turns your wall into an interactive canvas. Really want to have it and hope to see an additional option to back it up or an additional way to buy it
train.tweet ;- replace_contraction(train$Text)
I am a bit disappointed I can not backup @scribit @Kickstarter, which turns your wall into an interactive canvas. Really want to have it and hope to see an additional option to back it up or an additional way to buy it
Elimination of Numbers, Special Characters, URL, RT, User-ID
In this step, the numbers, special characters, URLs, RT tokens, User-IDs starting with “@” and “#” are replaced with empty/blank characters for the tweets. This process is done using the substitute function present in R which takes the pattern as a regular expression and returns a character vector of the same length.
The following code shows the substitution of the characters present in the tweet:
“RT @kenradio: Scribit – Turn your wall into an interactive canvas”
# removes RT tokens
train.tweet = gsub(“RT((?:\b\W*@\w+)+)”,””, train.tweet)
# removes http:// links
train.tweet = gsub(“http^:blank:+”,””,train.tweet)
# removes UserID followed by words
train.tweet = gsub(“@\w+”,””,train.tweet)
# removes hash-tag followed by words
train.tweet = gsub(“#\w+”,””,train.tweet)
# removes punctuation
train.tweet = gsub(‘:punct:’, ‘ ‘, train.tweet)
# removes numbers with words
train.tweet = gsub(‘^:alnum:’, ‘ ‘, train.tweet)
” Scribit Turn your wall into an interactive canvas ”
Elimination of Stop-words
In order to eliminate stop-words from the tweets, we are using the stopwords method available in “quanteda” package. Before we eliminate the stop-words, we need to convert the string of tweets into tokens and use the token_select method to remove the the stop-words which also contain the word “kickstarter” as one of the stop-word. Stopwords are removed in order to standardize the text.

The following line of codes show the removal of stop-words:
“I got inspired to make a wall drawing bot after seeing the Scribit on Kickstarter I like the motor and electronics in one place I want to expose some cool looking mechanisms so I thought some hubless spools might be the way to go I need a cool pen mechanism idea ”
train.tokens ;-tokens_select(train.tokens,c(stopwords(),”kickstarter”),
selection = “remove”)
text21 :
“got” “inspired” “make” “wall” “drawing” “bot” “seeing” “scribit” “like” “motor” “electronics” “one” “place” “want” “expose” “cool” “looking” “mechanisms” “thought” “hubless” “spools” “might” “way” “go” “need” “cool” “pen” “mechanism” “idea”
Standardising the text
The inputs of text can be standardized by lowering all the text using the tokens_tolower function of “quanteda” package which takes token as an input and returns the character vector.

Stemming is used to reduce the words from which it is derived. This process is important so as to standardize the text which is used for further analysis. We are using the tokens_wordstem function of “quanteda” package which takes tokens as input and we are restricting the stemming only for English language in this research.

The following code describes the stemming process in R:
“got” “inspired” “make” “wall” “drawing” “bot” “seeing” “scribit” “like” “motor” “electronics” “one” “place” “want” “expose” “cool” “looking” “mechanisms” “thought” “hubless” “spools” “might” “way” “go” “need” “cool” “pen” “mechanism” “idea”
train.tokens ;- tokens_wordstem(train.tokens, language = “english”)
tokens from 1 document.

text21 :
“got” “inspir” “make” “wall” “draw” “bot” “see” “scribit” “like” “motor” “electron” “one” “place” “want” “expos” “cool” “look” “mechan” “thought” “hubless” “spool” “might” “way” “go” “need” “cool” “pen” “mechan” “idea”
Sentiment Analysis ImplementationIn the previous section 3.3.4, sentiment analysis consist of different approaches, we are going to discuss about the implementation of those approaches and see their functionality as how that will affect the sentiments of the tweets in our research.
Polarity Approach
The polarity approach is used to interpret the sentiment based on their orientation/emotions of words from tweets. The polarity is carried out using the polarity method available in “qdap” library which takes the cleaned tweets and returns a set of result which contains different aspects by grouping variable and approximating the sentiment. The various result set returned from polarity are: all ; group. The “all” result consists a dataframe with the following CITATION CRA18 l 2057 41:
group_var: the grouping variable.

wc: word count of the particular tweet/text.

polarity: sentence polarity score.

pos_words: words which are considered as positive from the sentence.

neg_words: words which are considered as negative from the sentence.

text_var: text variable
The “group” variable from the result set consists of the dataframe with the average polarity score by grouping variables and consists of following information:
group_var: the grouping variable.

total_sentences: total sentences used.

total_words: total words used in the sentence.

ave_polarity: Sum of all polarity scores for a particular group divided by number of sentences used.

sd_polarity: Standard deviation of the group’s sentence level polarity scores.

stan_mean_polarity: standardized polarity score calculated by taking the average polarity score for a group divided by standard deviation.

The following code illustrated the polarity and the categorization for sentiments carried out for this research:
tweets.polarity <- polarity(train.tweet)
tweets_each_polarity <- tweets.polarity$all
polarity <- ifelse(tweets_each_polarity$polarity < 0, “Negative”, ifelse(tweets_each_polarity$polarity > 0.15, “Positive”, “Neutral”))
The following figure shows the result returned from polarity:
all total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
all 9036 170511 0.156 0.298 0.522

Figure SEQ Figure * ARABIC 21: Polarity result returned from tweets
Figure SEQ Figure * ARABIC 22: Plot of Sentiment Categories from Polarity Approach
Lexicon Approach
In the lexicon approach, each word has score associated with it, and we have used three different lexicons in this research which was discussed in the section Using the BING lexicons, the sentiment has categorized into two: Positive and Negative. The list of positive and negative words was taken from the BING notation and used for calculating the sentiments for the tweets. We built a function for calculation the sentiment score which takes sentences, positive, and negative words as arguments. The below code snippet shows the function:
score.sentiment = function(sentences, pos.words, neg.words, .progress=’none’)
scores = laply(sentences, function(sentence, pos.words, neg.words) {
sentence = gsub(‘:punct:’, ”, sentence)
sentence = gsub(‘:cntrl:’, ”, sentence)
sentence = gsub(‘\d+’, ”, sentence)
sentence = gsub(c(stopwords(), “kickstarter”),’ ‘, sentence)
sentence = tolower(sentence)
# split into words. str_split is in the stringr package
word.list = str_split(sentence, ‘\s+’)
words = unlist(word.list)
pos.matches = match(words, pos.words)
neg.matches = match(words, neg.words)
pos.matches = !
neg.matches = !
score = sum(pos.matches) – sum(neg.matches)
}, pos.words, neg.words, .progress=.progress )
scores.df = data.frame(score=scores, text=sentences)

The following graph shows the distribution of sentiments using BING lexicon:

Figure SEQ Figure * ARABIC 23: Distribution of sentiments using BING lexiconUsing the AFINN lexicon, the score ranges from -5 to +5 and the sentiments are categorized into two categories: Positive and Negative. We used the get_sentiment method from the “syuzhet” library which takes the tweet/sentence, and returns a numeric vector of sentiment values. The following code snippet shows the implementation for the AFINN lexicon:
sentiment_afinn <- get_sentiment(word.df)

The following graph shows the distribution of sentiments using AFINN lexicon:

Figure SEQ Figure * ARABIC 24: Distribution of sentiments using AFINN lexiconThe NRC lexicon is based on the emotions of the words and categorized into ten different categories. We have used the get_nrc_sentiment method of “syuzhet” library which takes the tweet as input and returns different emotion types. The following code snippet shows the implementation of the NRC lexicon:
nrc_sentiment <- get_nrc_sentiment(train.tweet)
Sentiment_Scores <- data.frame(colSums(nrc_sentiment,))
names(Sentiment_Scores) <- “Score”
Sentiment_Scores <- cbind(‘Sentiment’= rownames(Sentiment_Scores), Sentiment_Scores)
rownames(Sentiment_Scores) <- NULL

The below graph shows the distribution of different categories from NRC lexicon:

Figure SEQ Figure * ARABIC 25: Distribution of Sentiment Categories using NRC LexiconMachine Learning Approach
In the machine-learning approach, the data is converted to bag-of-words so that it can be used to analyse further. These bag-of-words were used for calculating the tf-idf weightages so that the machine-learning algorithm can understand so as to predict the Sentiment for the tweets. This was done using the dfm function from “quanteda” library which is used to construct a sparse document-feature matrix and this will be used to calculate the term-frequency and inverse-document frequency from the features.
The following code snippet shows the function used for calculating the tf, idf, and tf-idf weights:
train.tokens.dfm <- dfm(train.tokens, tolower = FALSE)
train.tokens.matrix <- as.matrix(train.tokens.dfm)
term.frequency <- function(row) {
row / sum(row)
inverse.doc.freq <- function(col) {
corpus.size <- length(col)
doc.count <- length(which(col > 0))
log10(corpus.size / doc.count)
tf.idf <- function(x, idf) {
x * idf
After the tf-idf is calculated, we used different machine learning algorithms to predict the sentiment on the test dataset. The following machine learning algorithms was used:
Decision Trees: Decision trees are basically used for both classification and regression problem. In our research, we have used 3 cross-folds and used repeated cross validation to train our model.
The following code snippet shows the R-Part used in training the model for predicting the sentiments:
cv.folds <- createMultiFolds(tweet.new_df$polarity, k = 10, times = 3)
cv.cntrl <- trainControl(method = “repeatedcv”, number = 10,
repeats = 3, index = cv.folds)
start.time <- Sys.time()
cl <- makeCluster(3, type = “SOCK”)
registerDoSNOW(cl) <- train(Sentiment ~ ., data =
train.tokens.tfidf_user.df, method = “rpart”,
trControl = cv.cntrl, tuneLength = 7)
total.time <- Sys.time() – start.time
# Check out our results.
Support Vector Machine: SVM is well suited for classification problem as it constructs hyperplane or set of hyperplanes in the high dimensional space, which can be used for classifying the sentiments of three classes: positive, negative, and neutral. This algorithm is helpful in text categorization and achieves a high accuracy compared to other traditional query refinement schemes. CITATION Wik181 l 2057 9.

Random Forest: Random Forest is an ensemble learning algorithm which uses multiple decision trees and is used for classification and regression problem. This algorithm also gives us the benefit of extracting the variable importance from the dataset which can be used to tune our model in later phase.

KNN: This algorithm is useful for classifying the nearest neighbors, the k value is based on the data and a good k value is selected by various hyper-parametric approach. For our research, it gave us the optimal k value as 5.

Naïve Bayes: This algorithm is based on the probabilistic approach and it is highly scalable, though it doesn’t give a high accuracy in predicting the sentiments but is useful in predicting the posterior probabilities of the sentiment.

The hybrid approach is carried out once the dimension of the features is reduced using the SVD and LSA technique. SVD gives us the linear transformation of the data and this technique is applied on the tf-idf scores of the feature matrix. We are using the “irlba” package which gives us the ability to reduce the dimension and use it with our lexicon sentiment categories to predict the actual sentiment of the data set using the above mentioned machine learning algorithms. The irlba method takes the matrix and returns a list of entries, from which we will just use the right approximate singular vectors restricted up to 300 document semantic space. The hybrid approach will use both the document feature i.e., bag of words and the sentiment score features to train our model.

Part of Speech TaggingIn order to identify the various corpus, part-of-speech tagging also known as POS tagging, marks the word in a corpus/text corresponding to their part of speech, based on its definition, or context. This is sometimes hard for tagging than just having a list of words and their parts of speech, because “some words can represent more than one part of speech at different times, and because some parts of speech are complex or unspoken.” CITATION Wik181 l 2057 9
POS tagging is also performed to predict the sentiment, and this gives us the insight that the Noun, Verb, Adjectives, i.e., NN, VBN, VB, VBD, etc. are some part of speeches which are helpful for predicting a true positive sentiments from the tweets.
The following tables shows the parts of Speech tagging used in R CITATION Pen l 2057 48:
Tag Description
CC Coordinating conjunction
DT Determiner
IN Preposition or subordinating conjunction
JJ Adjective
NN Noun
NNP Proper noun
VB Verb
VBN Verb, past participle
PRP Personal pronoun
RB Adverb
Table SEQ Table * ARABIC 6: POS tags and their descriptionThe following code snippet shows the implementation of POS in R which gives the below result REF _Ref521928335 h Figure 26:
pos_tag ;- pos(train.tweet)
tagged_pos ;- pos_tag$POStagged

Figure SEQ Figure * ARABIC 26: POS tags for tweets
RESULTThis chapter discusses the various results obtained from the research performed. The performance and accuracy of the algorithms and the lexicons for predicting the sentiment is described in the sections discussed.

Sentiment AnalysisIn this research, we obtained total of 12903 tweets and used the training dataset for training the various models. The proportion of tweets for each category is same in both case. The labelled data is categorised with three types of sentiments: Positive, Negative and Neutral by the help of the polarity approach, the summary statistics of the polarity score is used to find the mean score, and the same was used to categorise the sentiment. We also cross-checked the tweet and identified the sentiment whether it is categorized correctly using this approach. The below table shows the distribution of sentiments for all categories of Kickstarter project which was labelled using polarity score.
Table SEQ Table * ARABIC 7: Distribution of sentiment across categoriesCategories Negative Neutral Positive
Art 125 439 588
Comics ; Illustration 234 530 510
Design ; Tech 105 451 592
Film 126 520 631
Food ; Craft 50 303 636
Games 141 680 393
Music 73 374 375
Publishing 135 481 544
From the above data, we have the ratio of each category for Positive and Negative tweet, and the Food ; Craft Category seems to be most popular with less Negative tweets followed by Music across other categories. Also, if we check the Kickstarter statistics, the Music category seems to be the most popular with more number of successful projects. REF _Ref522196404 h Table 8 shows the statistics from the Kickstarter website. CITATION sta18 l 2057 6Table SEQ Table * ARABIC 8: Kickstarter StatisticsCategory Launched Projects Successful Dollars (in millions) Successfully Funded
Art 30921 88.34 12740
Comics ; Illustration 12228 76.45 6748
Design ; Tech 35665 657.22 11876
Film 68151 350.09 25268
Food ; Craft 26422 114.87 6544
Games 39759 799.7 14503
Music 56856 200.53 28037
Publishing 42984 125.98 13377
Lexicon AnalysisWe used two lexicons for predicting the sentiment of tweets namely AFINN and BING. As the NRC lexicons have ten categories, we are not taking that lexicon in consideration for predicting the sentiment. Using the AFINN lexicon, we have achieved a good number of positive tweets and less negative tweets. We have adjusted the score for categorizing the tweets into Positive, Negative and Neutral. If the score is less than negative 0.5, then it is considered as a Negative Tweet, if the score is greater than positive 0.49 then it is considered as a Positive Tweet, and if the score is between the range of -0.5 and 0.49 then it is a Neutral tweet. The sentiment score ranges from -5 to +5.

The following REF _Ref522198168 h * MERGEFORMAT Table 9 shows the distribution of the Sentiment categorization of tweets using AFINN lexicon.

Table SEQ Table * ARABIC 9: Analysis Result of Sentiment using AFINN lexiconCategories Negative Neutral Positive
Art 78 322 752
Comics ; Illustration 125 462 687
Design ; Tech 18 398 732
Film 63 323 891
Food ; Craft 5 257 727
Games 107 498 609
Music 45 212 565
Publishing 75 335 750
The BING lexicon gives us more number of Neutral tweets from the adjusted score of sentiment, if the score is less than 0, it is categorized into Negative, and if the score is greater than 0, it is considered as Positive, else Neutral Sentiment. Music category have less number of Negative tweets and this lexicon can be considered as more accurate, based on the Kickstarter statistics in REF _Ref522196404 h * MERGEFORMAT Table 8. The sentiment score ranges from -4 to +6 in our case.

The following REF _Ref522219248 h * MERGEFORMAT Table 10 shows the distribution of sentiment across various categories using BING lexicon.

Table SEQ Table * ARABIC 10: Analysis Result of Sentiment using BING lexiconCategories Negative Neutral Positive
Art 98 544 510
Comics ; Illustration 179 668 427
Design ; Tech 80 674 394
Film 154 571 552
Food ; Craft 73 333 583
Games 90 806 318
Music 49 427 346
Publishing 111 650 399
The below graph REF _Ref522221830 h Figure 27 shows the distribution of sentiment across various categories which was carried out using BING lexicon.

Figure SEQ Figure * ARABIC 27: Distribution of Sentiment across categories using BING lexicon
Machine Learning AnalysisIn this research, we have used the supervised machine learning algorithm, we used the training dataset to train our model and predicted the sentiments using the Sentiment as the output variable and the bag-of-words as feature/labels. The TF-IDF functionality comes into picture for predicting the sentiments for the tweets. The Decision Tree is trained with 10 repeated folds cross-validation technique which predicted from 9036 samples and using 5757 predictors to classify three types of sentiments: Positive, Negative and Neutral. It gave 71.13% accuracy using the TF-IDF functionality on decision tree machine learning algorithm.

The below figure shows the result using the decision tree machine learning algorithm.

Figure SEQ Figure * ARABIC 28: Result of Decision Tree algorithm using word of bags as predictorAs the dimension of the predictor was very large, so we implemented the Singular Value Decomposition technique and restricted the dimension to 300 as a default value to train our model using the decision tree. After reducing the dimensionality, the model gave use 65.4% accuracy using the 10 fold repeated CV technique. The below figure shows the output.

Figure SEQ Figure * ARABIC 29: Result of Decision Tree algorithm using SVD techniqueWe also used the KNN machine learning algorithm on the reduced dimension and we achieved an improvement on the accuracy with 78.13% which uses k as 5.

Figure SEQ Figure * ARABIC 30: Result of KNN algorithm using SVD techniqueRandom forest machine learning algorithm is used with multiple decision trees and the default number of trees was used to predict the accuracy of the sentiments which is 500. At each split, 17 variables, or features was splitted and used for training the model. The below figure shows the output for the Random forest.

Figure SEQ Figure * ARABIC 31: Result using Random Forest algorithm on SVDThe SVM model gives us the best separation between the three categories of sentiment with a 88.18% accuracy. We used the radial kernel to train our model for predicting the sentiments of tweets.

The confusion matrix function was used to see the overall accuracy of the training model and also suggests that which lexicon gives us a good accuracy. It is used to describe the classification model for which the true values are known.
The below result shows the confusion matrix for AFINN and BING lexicons.

Table SEQ Table * ARABIC 11: Confusion Matrix using AFINN lexiconConfusion Matrix and Statistics
Prediction Negative Neutral Positive
Negative 322 167 27
Neutral 418 2170 219
Positive 249 1441 4023
Overall Statistics

Accuracy : 0.721
95% CI : (0.7116, 0.7302)
No Information Rate : 0.4724
P-Value Acc ; NIR : ; 2.2e-16

Kappa : 0.5063
Mcnemar’s Test P-Value : < 2.2e-16
Statistics by Class:
Class: Negative Class: Neutral Class: Positive
Sensitivity 0.32558 0.5744 0.9424
Specificity 0.97589 0.8789 0.6455
Pos Pred Value 0.62403 0.7731 0.7042
Neg Pred Value 0.92171 0.7419 0.9260
Prevalence 0.10945 0.4181 0.4724
Detection Rate 0.03564 0.2402 0.4452
Detection Prevalence 0.05710 0.3106 0.6322
Balanced Accuracy 0.65074 0.7266 0.7939
Table SEQ Table * ARABIC 12: Confusion Matrix using BING lexiconConfusion Matrix and Statistics
Prediction Negative Neutral Positive
Negative 627 146 61
Neutral 257 3436 980
Positive 105 196 3228
Overall Statistics

Accuracy : 0.8069
95% CI : (0.7986, 0.815)
No Information Rate : 0.4724
P-Value Acc > NIR : < 2.2e-16

Kappa : 0.6722
Mcnemar’s Test P-Value : ; 2.2e-16
Statistics by Class:
Class: Negative Class: Neutral Class: Positive
Sensitivity 0.63397 0.9095 0.7561
Specificity 0.97428 0.7647 0.9369
Pos Pred Value 0.75180 0.7353 0.9147
Neg Pred Value 0.95586 0.9216 0.8110
Prevalence 0.10945 0.4181 0.4724
Detection Rate 0.06939 0.3803 0.3572
Detection Prevalence 0.09230 0.5172 0.3905
Balanced Accuracy 0.80412 0.8371 0.8465
There are some projects which are predicted as successful and on the verge to success using this sentiment analysis, and Kickstarter also shows that the project is successfully funded. Some of the projects which was successfully funded and identified using the sentiment analysis are: OAXIS Entertainment from Film Category, Villagers from Games Category, Smart Belt 2.0 from Design ; Tech Category, etc. We have also seen the total funds which they have received other than goal, and the number of backers came forward pledging for the project. The Music Category with very less number of negative sentiments have more successful projects as predicted from the sentiment, and the same was compared from the Kickstarter statistics.

CONCLUSION ; FUTURE SCOPEConclusionSocial media data is growing more, and more at an unbelievable rate which will increase in the coming years as well. In this research, the sentiment was carried out from the Twitter data, using the tweets of Kickstarter. Kickstarter is one of the best platform for showcasing a new product, or new initiative for the society, and using tweets we have predicted the sentiments for various projects which was successful and also shows the same in the prediction for some of the categories across various Kickstarter projects. We can conclude that Music, and Food ; Craft are the two popular categories among Kickstarter.

Sentiment analysis has been popular for making a product, or make an opinion about an initiative by funding the project, and the response we have for the projects. This research consisted of three stages:
Data Gathering from Twitter: The data was collected from Twitter using Twitter API using the Kickstarter’s project name, and also these were restricted to a minimum number of tweets.

Data Pre-Processing and Visualization: Tweets was preprocessed and cleaned using different techniques, and these were used for visualizing the data and get some insights using different features gathered from Twitter.

Sentiment Analysis: We have used these tweets for classifying the sentiment into three categories: Positive, Negative, and Neutral. We have also used different lexicons, and used that for prediction.

In this research, the machine learning approach gave us a better classification in comparison to the lexicon approach. We have analysed that it would be much more accurate with 88.18% accuracy using the SVM technique on the feature set. The SVD technique plays an important role in classifying the sentiment in this research. Using the different lexicon, the BING lexicon is more accurate with 80.21% accuracy.

The following REF _Ref522310937 h Table 13 shows the accuracy for different classifier and lexicons:
Table SEQ Table * ARABIC 13: Accuracy for various classifiers and lexiconsClassifier/Lexicon Accuracy
Decision Tree 65.44%
KNN 78.13%
Naïve Bayes 49.85%
Random Forest 83.51%
Support Vector Machine 88.18%
AFINN Lexicon 72.1%
BING Lexicon 80.69%
Future ScopeThe future scope for predicting the sentiment for the Kickstarter projects, and see whether a project which is launched on the Kickstarter platform will be a success or not. This can be done in the future research:
The Kickstarter project’s data can be more helpful in predicting the real outcome by getting more data like the number of backers, number of users actually supporting at the moment, data for the backers, fund raised till date, any other projects initiated by the same author, or creator, and time trend of the project based on categories.

Since, the tweets were restricted to 15, if we have more number of tweets as the default version of Twitter Developer only gives us the privilege to get 100 tweets per request, or query then it might be more helpful in getting some more insights about the Kickstarter projects.

There can be more analysis done using the Natural Language Processing technique for getting the sentiment and predicting the outcome for the product, or project.

In our research, we have used only unigram technique to classify the sentiments, we can use the different n-gram techniques in future to get a good accuracy overall, and get a good precision for the prediction.

The comments, or reviews available on the Kickstarter website for each project, may be helpful if we get the data originally from the Kickstarter. Since, it was difficult getting the data through web-scrapping technique as it was beyond the scope because of the terms and conditions of the Kickstarter platform.

We can consider the Favorite and Retweet counts to predict the success of the project , as there were more number of Retweet for various projects.

1 K. Ahmed, N. E. Tazi and A. H. Hossny, “Sentiment Analysis Over Social Networks: An Overview,” IEEE, 2015.
2 Kickstarter, “Kickstarter, PBC © 2018,” 28 April 2009. Online. Available:

3 Nazan Öztürk, Serkan Ayvaz, “Sentiment analysis on Twitter: A text mining approach to the Syrian refugee crisis,” Telematics and Informatics, Elsevier, Volume 35, pp. 136-147, April 2018.
4 Wikipedia- Twitter, “Twitter Wikipedia,” 2018. Online. Available:

5 E. Haddi, X. Liu and Y. Shi, “The Role of Text Pre-processing in Sentiment Analysis,” Procedia Computer Science 17, p. 26 – 32, 2013.
6 Kickstarter, “Kickstarter Statistics,” 05 June 2018. Online. Available:

7 M. Speriosu, N. Sudan, S. Upadhyay and J. Baldridge, “Twitter Polarity Classification with Label Propagation over Lexical Links and the Follower Graph,” Proceedings of EMNLP 2011, Conference on Empirical Methods in Natural Language Processing, pp. 53-63, 2011.
8 Twitter, “Twitter,” 2018. Online. Available:

9 Wikipedia, “Wikipedia,” 2018. Online. Available:

10 Social Media, “The Structure of a Perfect Tweet,” 24 Feb 2011. Online. Available:

11 “Sentiment Analysis,” 07 06 2018. Online. Available:

12 “Google Trends,” 2018. Online. Available:

13 P. Bo and L. Lee, “Opinion mining and sentiment analysis,” Foundations and Trends in Information Retrieval, p. 1–135, 2008.
14 Techopedia, “Technopedia,” Online. Available:

15 D. F. Wei, Artist, Sentiment Analysis and Opinion Mining. Art. Natural Language Computing Group, Microsoft Research Asia.
16 N. Kishor, “5 Things You Need to Know about Sentiment Analysis and Classification,” 27 March 2018. Online. Available:

17 X. Zhu, S. Kiritchenko and S. M. Mohammad, “Recent Improvements in the Sentiment Analysis of Tweets,” NRC-Canada-2014, 2014.
18 MatsWichmann, “The Python Wiki,” 2017. Online. Available:

19 TIOBE, “TIOBE,” 6 June 2018. Online. Available:

20 R Foundation, “What is R?,” 2018. Online. Available:

21 A. Giachanou and F. Crestani, “Like It or Not: A survey of Twitter Sentiment,” ACM Computing Surveys vol. 49, no. 2, pp. 1-41, 2016.
22 B. Liu, “Sentiment analysis: Mining opinions, sentiments, and emotions,” Cambridge University Press, 2015.
23 H. Yu and V. Hatzivassiloglou, “Towards answering opinion questions: Separating facts from opinions and identifying the polarity of opinion sentences.,” In Proc. of EMNLP, 2003.
24 P. Chikersal, S. Poria and E. Cambria, “SeNTU: Sentiment Analysis of Tweets by Combining a Rule-based Classifier with Supervised Learning,” Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), p. 647–651, 2015.
25 A. Agarwal, B. Xie, I. Vovsha, O. Rambow and R. Passonneau, “Sentiment Analysis of Twitter Data,” Proceedings of the Workshop on Language in Social Media, LSM , pp. 30-38, 2011.
26 H. Bagheri and M. J. Islam, “Sentiment analysis of twitter data,” Cornell University Library, p. 5, 2017.
27 S. Alhojely, “Different Applications and Techniques for Sentiment Analysis,” International Journal of Computer Applications, Volume 154 – No.5, 2016.
28 W. Kaur and V. Balakrishnan, “Sentiment Analysis Technique: A Look into Support Vector Machine and Naive Bayes,” Proceedings of 2016 International Conference on IT, Mechanical ; Communication Engineering (ICIMCE), 2016.
29 F. Alberto Pozzi, D. Maccagnola, E. Fersini and E. Messina, “Enhance User-Level Sentiment Analysis on Microblogs with Approval Relations,” Springer International Publishing Switzerland, AI*IA, pp. 133-144, 2013.
30 A. Pak and P. Paroubek, “Twitter as a Corpus for Sentiment Analysis and Opinion Mining,” Proceedings of the International Conference on Language Resources and Evaluation, LREC, 2010.
31 F. Årup Nielsen, “A new ANEW: Evaluation of a word list for sentiment analysis in microblogs,” Proceedings of the ESWC2011 Workshop on ‘Making Sense of Microposts’: Big things come in small packages, pp. 93-98, 2011.
32 G. Siegle, “San Diego State University,” Online. Available:

33 Urban Dictionary, “Urban Dictionary,” Online. Available:

34 Steven J. DeRose, “The Compass DeRose Guide to Emotion Words,” 2005. Online. Available:

35 M. Bradley and P. Lang, “Instruction manual and affective ratings,” The Center for Research in Psychophysiology, University of Florida, Florida, 1999.

36 A. Go, R. Bhayani and L. Huang, “Twitter Sentiment Classification using Distant Supervision,” Processing, 150, 2009.
37 B. Yu, “An evaluation of text classification methods for literary study,” Literary and Linguistic Computing 23, 2008.
38 P. Sánchez-Mirabal, Y. Torres, S. Alvarado, Y. Gutiérrez, A. Montoyo and R. Muñoz, “Sentiment Analysis in Twitter using Polirity Lexicons and Tweet Similarity,” UMCC_DLSI, 2014.
39 Twitter-Developer, “Twitter- Dev,” 2018. Online. Available:

40 Tweepy, “Tweepy,” 2009. Online. Available:

41 R Documentation, “CRAN- R Documentation,” 2018. Online. Available:

42 Tableau Software-Wiki, “Wikipedia,” 2017. Online. Available:

43 S. M. Mohammad and P. D. Turney, Computational Intelligence- Crowdsourcing a Word-Emotion Association Lexicon, 2013.
44 S. Mohammad, “NRC Word-Emotion Association Lexicon,” National Research Council Canada (NRC), 2011.
45 TF-IDF, “Tf-idf,” Online. Available:

46 J. Silge and D. Robinson, Text Mining with R a Tidy Approach, Sebastopol : O’Reilly, 1 edition, 2017.
47 R. Singh, “Singular Value Decomposition. Elucidated.,” 18 May 2017. Online. Available:

48 P. Treebank, “Alphabetical list of part-of-speech tags used in the Penn Treebank Project,” Online. Available:

49 H. P. Luhn, “A Statistical Approach to Mechanized Encoding and Searching of Literary Information,” IBM Journal of research and development. IBM, 1957.
50 K. Spärck Jones, “A Statistical Interpretation of Term Specificity and Its Application in Retrieval,” Journal of Documentation, 1972.
51 N. Öztürk and S. Ayvaz, “Sentiment analysis on Twitter: A text mining approach to the Syrian refugee crisis,” Elsevier BV, Telematics and Informatics, pp. 136-147, April 2018.
52 F. Å. Nielsen, “AFINN,” Informatics and Mathematical Modelling, Technical University of Denmark, 2011.

Post Author: admin


I'm Sarah!

Would you like to get a custom essay? How about receiving a customized one?

Check it out