One of the major factor and methodologies stated in chapter one of this work was to do an extensive research on how to combat spam unsolicited/ unwanted advertisements, messages, or telemarketer calls as spam through user behaviour analysis.
To also carry on progressive reviews on various types of spam and their function and how they affect the user environment.
Thereafter, this chapter presents on overview of the different type of spam and how they can be identify through content management or html meta-tags, url with user behaviour analysis.
This chapter will explain why spammer crate spam, what are their motive and why do they focus on the user community, are the doing this for profit making or for business purpose.
Is the word spam harmful? How can we prevent it from causing an embarrassment for the users environments, why spammer so interested in users email address doing unwanted advertisements.
Spam are the range of unwanted pop- ups, links, data and email that we received in our daily interactions why using the internet (web).
The word spam actually has its roots from an American food company called Hormel foods, and was the first can of Spam luncheon meat which was produced in 1937 for World War II that was often unwanted but ever present.
Using the world spam to describe unsolicited email started in April 1993, not to an Email but to unwanted postings on USENET news group network and that was done by a mercenary programmer.
He accidentally posted 200 messages on the USENET news group, a Canter that was consider as the world’s largest online conferencing system after that incident one person later called this massage spam which coin the name to be used world-wide, later in January 18, 1994, the first ever known large-scale email spam occurred within the same USENET news group which became the second and was later follow by numerals of other once.
Spam’s are unwanted, but it is harmful, misleading and problematic for website in so many ways, in which websites owner may not want it to be.
There are many types spam’s but one of the most know types is email spam, it could a newsletters, email from friend relatives but when it had a huge inheritances or advertisements, email spam became increasingly popular from January 18, 1994.
spam are written in all form of languages spam does not have a language barrier, English is the most common one, it comes in all languages including Chinese, Korean and other Asian languages.
For the past years and now internet users have gotten used to spam as part of their daily life, because when you are online subbing the web and receiving emails you must get in contact with spam, whether you’re using Gmail, outlook, yahoo and other system, there have been an online system developed to help automatically combat filter and remove unwanted emails from our inboxes.
In the past, spam used to favor email as it was the primary communication tool. Email addresses were relatively easy to harvest via chat rooms, websites, customer lists and that impact a user’s address book. Eventually, email filters became more sophisticated, and more effectively decreased spam from clogging the inbox.
Apart from the email spam, there is another spam that grow into a pop-pup ads and added itself within the desktop browsers, other than the email, this spam has been one of the
most well know spam that continue to cause problem since 2016; but with the help of different developing technique many of these invasive ads have been denied access to computers by antivirus programs and other ad blockers, these applications functions as email spam filters remover in the background without user input.
With the advancement in technological world search engines had begun to take over the market as Google been the lead, web users has begun to realized that spam are been benefiting and could still benefits within the search engine results pages (SERPs), because of these factors, this lead to the outbreak of keyword and link related strategies to manipulate the search results.
With their sterilities spam would grow to become known by the name BLACK HAT SEO, with the time search engines took notice of it and begin to fine a solutions,
Bing, Google, Yahoo and other quickly took aim at their tactics of manipulating their results, they enforce their security and improved on their systems by creating an update of their algorithms.
After all their update Google give the name of their update as Panda and Penguin ( Among others), these update helped kick of the black hat SEO spam sterilities out of these systems by making them less important on those sites that used them.
While search engines providers have gotten smatter about the way spam behave so spammers too, in this battle spammers are gotten advanced to the traffic and other kinds of web spams search engines has responded with more updates on their algorithm.
This problems could continue forever with spam and search engines fighting, but the help of digital marketers SEO practitioners they since weighed, in trying to combat how search engines work, they learned how to block, avoid and Abolish spam.
2.2.3 Types of spam
Spammers produced many difference types of spams on a daily basic and there are many different way to combat those spams.
As we are about to discuss these spam, there are many different types but for the purpose of this research we are going to focus on the most common ones that you enactor on a daily basis, and what can we do to minimize their negative motive and impact on the user enviroment.
These are some of the most common types of spam and their categories:
Fixing social media’s spam
Click baiting and like jacking
The main aim of a search spam is to increase the results by ranking the website in search usually with the impression that minimal input will yield higher rewords, typically this will be used to increase the ranking of the spam user’s own website, over competitors. One of the most common search spam tactics is known as keyword stuffing.
Search spam is a form of spam that focus on the using of a Keyword, keyword that you would expect like the title of the page, page description of a business and the content; these spam would also hide the keywords where users will not see but the search engine would.
One of the contributing factor of a content on a site is to help the site to be ranked within a search. When these spam are within the content, including keywords cloaking and stuffing play a key part allowing the content to be a contributing factor within search spam.
The information’s the website carry has a lot which matters to search engines as much as the user’s environment, some website owner have attempted to increase their keyword density by brining on board some generated machine content or scripting content by stealing information from other sites to increase their own site.
Referral Spam, link spam
Referral or Link spam is one of the most dangerous spam in terms of search engines, because every sites has a links and from one links to another it create a connections that will lead to spammers boosting their sight from another sites, bad links can cause a major damage to your site.
For the past years, Google the search engines commentator has developed it algorithm to better identify genuine sites and links in terms of quality, but yet it has not been perfect yet.
During those same time where the search engines commentator was working on the algorithm there was a rules governing the sites, if your sites was found to be a spam or has some spam attach you will be fine to penalty of huge amount of dollars.
With all these struggle in 2016 google improved the system by bringing on board the update of penguin in the browse, in other to avoid these penalization as a sites owner you either removed the bad links or improving your current ones.
The problem with how link building can be carried out, and why many sites have to deal with link spam, is because it has been carried out deceptively. Black hat SEOs and those looking for quick results, used the early search engine algorithm’s basic nature to attract what would go on to be known as unnatural links.
In the spam world this is the third major form of spam you may come across, and it is known as traffic spam. the traffic spam can be things of one or two not necessarily excusive of the other these traffic spam are links spam, where you will received a links from a bad site and receive traffic from that same site that will cause high bounces from users, with all these strategists spammers used, those connections are considered to be traffic spam base on it negative impact it will cause on a sites which is consider as unwanted.
Fixing social media’s spam
With the problems of fixing social media’s is spam is very common because of creating of fake account in the social application is very easy in that every account you
Create, you may have to through an identity verification process which is an easy way of spammer bypassing and getting access to your account.
This methods of identity verification which include providing your email which may require you to do the following, email verification-only password-only and phone number-only. Email verification-only is problematic because a single user can create many email accounts in a short time, which can then be used to create fake accounts in the social application.
Password-only is also problematic because spammers may use an automated tool called “account checker” to test different username and password combinations on the social application in the hope that a few may work to gain them access to these accounts
left47016400. Captchas are supposed to deter the automated creation of accounts, but they can be quickly outsourced for people to solve inexpensively.
The use of phone verification upon creation of accounts can prevent spammers from creating fake accounts. This involves sending a one-time password (OTP) to a user over a separate communication channel (SMS or voice) than the IP channel (internet) used by the social application.
If an account can only be created after the user has correctly entered the OTP in the social application, this will make creation of fake accounts a more tedious process.
Clickbaiting and likejacking
Clickbaiting and likejacking spam are the form of spammer who create a sensation posting list encouraging the user to click through the contents of their site for the so purpose of generating online advertisements revenue, When the user clicks through to the page, the content usually doesn’t exist or is radically different from what the headline made it out to be.
It also carry on the act of tricking users to post a Facebook status update for a certain site without the user’s prior knowledge or intent.
The user may be thinking that they are just visiting a page but the click can trigger a script in the background to share the link on Facebook.
This will then create a vicious cycle as other friends of the genuine user will click on the link and share it to more people on their network.
These activities negatively impact the user experience as they can either waste the user’s time and attention as well as potentially compromise the user’s security or steal their data.
The presence of fake accounts and spam are therefore a problem for social networks, over-the-top (OTT) messaging and other mobile or gaming applications because a negative user experience can lead to user attrition, impact monetization potential and valuation of the service.
Although social networks are starting to clamp down on fake accounts and spam, spammers can easily create new fake accounts to continue their activities.
Bulk messaging is a form of broadcast messaging that the spammers used to send out a large quantity of messages to a group for a short period of time, several spam accounts can also simultaneously post duplicate messages.
Bulk messaging can artificially cause a certain topic to trend if enough people visit view that topic than it provide them to do more broadcast.
In 2009, a spam website offering a job with Google tricked users to believe that the site was genuine so more people visited the sites, many of those messages can also spread malware or directing a user to specific site.
2.2.4Anti-spam filtering Techniques
Spam-reduction techniques have developed rapidly over the last few years, as spam volumes have increased. We believe that the spam problem requires a multi-faceted solution that combines a broad array of filtering techniques with various infrastructural changes, changes in financial incentives for spammers, legal approaches, and more
Spam Guru addresses the part of this multi-faceted approach that can be handled by technology on the recipient’s side, using plug-in tokenizes and parsers, plug-in classification modules, and machine-learning techniques to achieve high hit rates and low false-positive rates. Effective controls need to be deployed to countermeasure the ever growing spam problem.
Machine learning provides better protective mechanisms that are able to control spam. This paper summarizes most common techniques used for anti-spam filtering by analyzing the e-mail content and also looks into machine learning algorithms such as Naïve Bayesian, support vector machine and neural network that have been adopted to detect and control spam. Each machine learning has its own strengths and limitations as such appropriate preprocessing need to be carefully considered to increase the effectiveness of any given machine learning.
Based on the Ferris (2009), spam can be categorized into the following:
1. Health; such as fake pharmaceuticals; 2. Promotional products; such as fake fashion items (for example, watches); 3. Adult content; such as pornography and prostitution; 4, Financial and refinancing; such as stock kiting, tax solutions, loan packages; 5. Phishing and other fraud; such as “Nigerian 419” and “Spanish Prisoner”; 6. Malware and viruses; Trojan horses attempting to infect your PC with malware; 7. Education; such as online diploma; 8. Marketing; such as direct marketing material, sexual enhancement products; 9. Political; US president votes.
SPAMMER TRICKS In order to send spam, spammers first obtain e-mail addresses by harvesting addresses through the Internet using specialized software.
This software systematically gathers e-mail addresses from discussion groups or websites (Schaub, 2002), other than that spammer also able to purchase or rent collections of e-mail addresses from other spammers or services providers. He also indicated that many tricks used by spammers to avoid detection by spam filters. Lot work has gone into finding solution to spam problem from different dimensions and directions. Various anti-spam solutions are available that have been surveyed by many other.
(Blanzieri and Bryl, 2008,
Caruana and Li, 2012,
Guzella and Caminhas, 2009,
Yu and Xu, 2008,
Paswan et al., 2012,
Those are blacklists, whitelists, grey lists, content based filtering, feature selection methods, bag-of-words, machine learning techniques such as Naïve Bayes, Support vector machines, artificial neural networks, lazy learning, etc), reputation based techniques, artificial immune systems, protocol based procedures, and so on.
(Caruana and Li, 2012) also lists some emerging approaches such as per to peer computing, grid computing, social networks and ontology based semantics along with few other approaches.
These solutions can be grouped into various categories such as list based techniques, and filtering techniques; another categorisation can be prevention, detection and reaction techniques (Nakulas et al., 2009). (Paswan et al., 2012) categorises the email spam filtering techniques as origin based spam filtering, content based filtering, feature selection methods, feature extraction methods, and traffic based filtering.
The scope of this paper is content based filtering and in specific learning based filters. Hence, we would not go into detail of each of these solutions but limit ourselves to Bayes algorithm.
Spammers are insensitive to the consequences of their activities and need to be dissuaded by being made to pay by the internet service providers for the waste of bandwidth occupied by unwanted spam blocked by the servers. This would be a feasible deterrent to reduce spam. To execute this, all service providers must act in CRPIT Volume 149 – Information Security 2014 68 unison and agree to get spammers to pay for spam inconvenience and servers clean up.
Users in various Domains have difference preference of emails that they would classify as Spam or Ham. An email from a bookseller trying to sell books on Computer Science would be Spam for a Pharmacist.
The most difficult question is how these filters can will identify which emails is/are spam and which is not harmful to the person receiving it.
Of course in some mail clients such as Gmail there is an option of user preference setting where user can be given an option to choose the topic area of interest and then the filter can use that information to classify incoming emails accordingly. There is very little work gone into the area of considering specific user preferences while designing anti-spam filters.
(Kim et al., 2007) constructs a user preference ontology on the basis of user profile and user actions and trains the filter on the basis of that ontology.
(Kim et al., 2006) suggest user action based adaptive learning where they attach weights to Bayesian classification on the basis of user actions. However none of the work address user belonging to a particular domain and their preferences accordingly.
SpamBayes filter classifies emails using Bayesian Classifier. It extracts the email content and use the presence of each word (token) to determine the class – spam, or ham legitimate, or unsure – of the email.
Each word (token) has a spam score which is used to calculate the overall message score. An email with high message score will be classified as a spam, i.e. when an email contains many words that have high spam score, the email is likely to be classified as spam. Nelson et al. and Tsai explain in more detail the SpamBayes classification mechanism.
In the training phase, when SpamBayes sees an email of spam class, it will increase the spam score for all of the tokens present in the email. Similarly, when the system sees a ham (legitimate) email, all of the token present will have its score reduced.
SpamBayes retrains itself periodically with the new data that it has collected. This has created a way for an attacker to place his malicious data (email) into the training data set of the SpamBayes filter. Simply by sending the malicious email into the target user’s inbox would cause the user to label it as a spam and, later, when the machine retrains itself the malicious email can become part of the training dataset that it would train from.
Knowing the classification mechanism and the training algorithm of SpamBayes, attackers can craft some malicious emails to attack the classifier; making the overall classification performance so poor that the whole classification system becomes useless with the dictionary attack, or making it unable to classify a certain email properly i.e. causing a wrong classification of a particular email with the targeted attack.
2.2.5 TYPE OF SPAM-FILTERING METHODS
This popular spam-filtering method attempts to stop unwanted email by blocking messages from a preset list of senders that you or your organization’s system administrator create. Blacklists are records of email addresses or Internet Protocol (IP) addresses that have been previously used to send spam. When an incoming message arrives, the spam filter checks to see if it’s IP or email address is on the blacklist; if so, the message is considered spam and rejected.
Though blacklists ensure that known spammers cannot reach users’ inboxes, they can also misidentify legitimate senders as spammers. These so-called false positives can result if a spammer happens to be sending junk mail from an IP address that is also used by legitimate email users. Also, since many clever spammers routinely switch IP addresses and email addresses to cover their tracks, a blacklist may not immediately catch the newest outbreaks.
Real-Time Blackhole List
This spam-filtering method works almost identically to a traditional blacklist but requires less hands-on maintenance. That’s because most real-time blackhole lists are maintained by third parties, who take the time to build comprehensive blacklists on the behalf of their subscribers. Your filter simply has to connect to the third-party system each time an email comes in, to compare the sender’s IP address against the list. Since blackhole lists are large and frequently maintained, your organization’s IT staff won’t have to spend time manually adding new IP addresses to the list, increasing the chances that the filter will catch the newest junk-mail outbreaks. But like blacklists, real-time blackhole lists can also generate false positives if spammers happen to use a legitimate IP address as a conduit for junk mail. Also, since the list is likely to be maintained by a third party, you have less control over what addresses are on — or not on — the list.
A whitelist blocks spam using a system almost exactly opposite to that of a blacklist. Rather than letting you specify which senders to block mail from, a whitelist lets you specify which senders to allow mail from; these addresses are placed on a trusted-users list. Most spam filters let you use a whitelist in addition to another spam-fighting feature as a way to cut down on the number of legitimate messages that accidentally get flagged as spam. However, using a very strict filter that only uses a whitelist would mean that anyone who was not approved would automatically be blocked. Some anti-spam applications use a variation of this system known as an automatic whitelist. In this system, an unknown sender’s email address is checked against a database; if they have no history of spamming, their message is sent to the recipient’s inbox and they are added to the whitelist.
A relatively new spam-filtering technique, greylists take advantage of the fact that many spammers only attempt to send a batch of junk mail once. Under the greylist system, the receiving mail server initially rejects messages from unknown users and sends a failure message to the originating server. If the mail server attempts to send the message a second time — a step most legitimate servers will take — the greylist assumes the message is not spam and lets it proceed to the recipient’s inbox. At this point, the greylist filter will add the recipient’s email or IP address to a list of allowed senders. Though greylist filters require fewer system resources than some other types of spam filters, they also may delay mail delivery, which could be inconvenient when you are expecting time-sensitive messages.
Rather than enforcing across-the-board policies for all messages from a particular email or IP address, content-based filters evaluate words or phrases found in each individual message to determine whether an email is spam or legitimate.
A word-based spam filter is the simplest type of content-based filter. Generally speaking, word-based filters simply block any email that contains certain terms.Since many spam messages contain terms not often found in personal or business communications, word filters can be a simple yet capable technique for fighting junk email. However, if configured to block messages containing more common words, these types of filters may generate false positives. For instance, if the filter has been set to stop all messages containing the word “discount,” emails from legitimate senders offering your nonprofit hardware or software at a reduced price may not reach their destination. Also note that since spammers often purposefully misspell keywords in order to evade word-based filters, your IT staff will need to make time to routinely update the filter’s list of blocked words.
Heuristic (or rule-based) filters take things a step beyond simple word-based filters. Rather than blocking messages that contain a suspicious word, heuristic filters take multiple terms found in an email into consideration. Heuristic filters scan the contents of incoming emails and assigning points to words or phrases. Suspicious words that are commonly found in spam messages, such as “Rolex” or “Viagra,” receive higher points, while terms frequently found in normal emails receive lower scores. The filter then adds up all the points and calculates a total score. If the message receives a certain score or higher (determined by the anti-spam application’s administrator), the filter identifies it as spam and blocks it. Messages that score lower than the target number are delivered to the user. Heuristic filters work fast — minimizing email delay — and are quite effective as soon as they have been installed and configured. However, heuristic filters configured to be aggressive may generate false positives if a legitimate contact happens to send an email containing a certain combination of words. Similarly, some savvy spammers might learn which words to avoid including, thereby fooling the heuristic filter into believing they are benign senders.
Bayesian filters, considered the most advanced form of content-based filtering, employ the laws of mathematical probability to determine which messages are legitimate and which are spam. In order for a Bayesian filter to effectively block spam, the end user must initially “train” it by manually flagging each message as either junk or legitimate. Over time, the filter takes words and phrases found in legitimate emails and adds them to a list; it does the same with terms found in spam. To determine which incoming messages are classified as spam, the Bayesian filter scans the contents of the email and then compares the text against its two-word lists to calculate the probability that the message is spam. For instance, if the word “valium” has appeared 62 times in spam messages list but only three times in legitimate emails, there is a 95 percent chance that an incoming email containing the word “valium” is junk. Because a Bayesian filter is constantly building its word list based on the messages that an individual user receives, it theoretically becomes more effective the longer it’s used. However, since this method does require a training period before it starts working well, you will need to exercise patience and will probably have to manually delete a few junk messages, at least at first.
2.2.6 Word Wide Web (WWW)
The World Wide Web (also known as “WWW”, “Web, or “W3”) is the universe of network-accessible information, the embodiment of human knowledge. It allows people to share information globally.
That means it allows anyone to read and publish documents freely. The World Wide Web hides all the detail of communication protocols, machine locations, and operating systems from the user. It allows users to point to any other Web pages without any restrictions.
2.2.7 Web Spam
Web spam has a negative impact on the search quality and users’ satisfaction and forces search engines to waste re- sources to crawl, index, and rank it. Thus search engines are compelled to make significant efforts in order to fight web spam.
Traffic from search engines plays a great role in online economics. It causes a tough competition for high positions in search results and increases the motivation of spammers to invent new spam techniques.
At the same time, ranking algorithms become more complicated, as well as web spam detection methods. So, web spam constantly evolves which makes the problem of web spam detection always relevant and challenging. As the most popular search engine in Google faces the problem of web spam and has some ex- parties in this matter.
This research explain our experience in detection different types of web spam on a content, links, click, and user behaviour We also review aggressive advertising and fraud because they affect the user experience. Besides, we demonstrate the connection between classic web spam and modern social engineering approaches in fraud
Web spam is annoying to search engine users and disruptive to search engines; therefore, most commercial search engines try to combat web spam. Combating web spam consists of identifying spam content with high probability and also depending on policy downgrading it during ranking, eliminating it from the index, no longer crawling it, and tainting affiliated content.
The first step in identifying likely spam pages – is a classification problem amenable to machine learning techniques. Spam classifiers take a large set of diverse features as input, including content-based features, link-based features, DNS and domain-registration features, and implicit user feedback. Commercial search engines treat their precise set of spam-prediction features as extremely proprietary, and features (as well as spamming techniques) evolve continuously as search engines and web spammers are engaged in a continuing “arms race.”
Web spam filtering, the area of devising methods to identify useless Web content with the sole purpose of manipulating search engine results, While Web spam is targeted for the high commercial value of top-ranked search-engine results, Web archives observe quality deterioration and resource waste as a side effect. So far Web spam filtering technologies are rarely used by Web archivists but planned in the future as indicated in a survey with responses from more than 20 institutions worldwide.
These archives typically operate on a modest level of budget that prohibits the operation of standalone Web spam filtering but collaborative efforts could lead to a high quality solution for the
2.2.8 Web Logs
A Web log file records activity information when a Web user submits a request to a Web Server. The main source of raw data is the web access log which we shall refer to as log file. As log files are originally meant for debugging purposes. The log file can be located in three different places: 1. Web Servers, 2. Web proxy Servers, and 3. Client browsers. And each suffers from two major drawbacks Server-side logs:
Server-side logs: logs generally supply the most complete and accurate usage data, but their two major drawback it include:
This log contain sensitive, personal information, therefore the server owners usually keep them closed.
The logs do not record cached pages visited. The cached pages are summoned from local storage of browsers or proxy servers, not from web servers.
Proxy- side logs: A proxy server takes the HTTP requests from users and passes them to a Web server then returns to users the results passed to them by the Web server. The two disadvantages are:
Proxy-server construction is a difficult task. Advanced network programming, such as TCP/IP, is required for this construction.
The request interception is limited, rather than covering most requests.
The proxy logger implementation in Web Quilt, a Web logging system performance declines if it is employed because each page request needs to be processed by the proxy simulator.
Client-side logs: Participants remotely test a Web site by downloading special software that records Web usage or by modifying the source code of an existing browser. HTTP cookies could also be used for this purpose. These are pieces of information generated by a web server and stored in the users’ computers, ready for future access. The drawbacks of this approach are:
The design team must deploy the special software and have the end-users install it.
The technique makes it hard to achieve compatibility with a range of operating systems and Web browsers.
2.2.9 Identifying User Behavior by Analysis
The users enviroment has contributed immensely on the sharing of spam pages, spam sites and also making spam pages or sites to gain more relavent.
More users who visit the web are not abreast to spam pages or sites, because of these reasons they normally click on every pages that come up with a interesting topics.
Throughout the years having knowledge of Web users’ interests, navigational actions and preferences has gained importance due to the objectives of organizations, companies online social networks.
Traditionally this field has been studied from the Web Mining perspective, particularly through the Web Usage Mining concept, which consists of the application of machine learning techniques over data originated in the Web (Web data) for automatic extraction of behavioral patterns from Web users.
WUM makes use of data sources that approximate users’ behavior, such as weblogs or clickstreams among others; however these sources imply a considerable degree of subjectivity to interpret.
For that reason, the application of biometric tools with the possibility of measuring actual responses to the stimuli presented via websites has become of interest in this field.
Instead of doing separate analyses, information fusion (IF) tries to improve results by developing efficient methods for transforming information from different sources into a single representation, which then could be used to guide biometric data fusion to complement the traditional WUM studies and obtain better results.