PROPOSED Repeated letters removal 3) Noise data removal


The proposed system Collection
data from the twitter social networking site and processes data using NLP techniques.
We are using two approach one is sentiment mining and other is data mining .Sentiment
mining is used for unstructured data and real time data. As data mining is used
for structured data and history data.

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!

order now



The system consists of the
following modules

1) Data collection  module

2) Sentiment mining

3) data mining

4) output  Classification



 Data collection  module

The tweets are fetched using the
Twitter API . The API provides a user friendly programming interface through
which download the tweet object  in tweet
Object  format. This object format helps
in extracting specific tweet attributes like 
user name, location, time, re-tweet count etc. Once the data is fetched pre-processing
of the gathered data is done to extract features.


Sentiment mining:- 
sentiment mining system identification
of tweet without knowing the  previous
background. Before applying the algorithm data pre-processing is required. The data
undergo the following processes.

1) Stop Word removal

2) Repeated letters removal

3) Noise data removal

Parsing and tokenization

Stop word removal: Stop
words are those words which generally do not carry any  useful information but are added to get the
grammar of the sentence. For example prepositions like on, in, to, above etc.,
articles like a, the, an question words like who, what, where, how etc.,
generally do not add any information to the content. But they are always found
in large amount  in a sentence. So, these
words are to be removed from a sentence before applying the algorithm.


Repeated letters removal: People tend to show 
their emotional state by repeating the letters of words in the tweets like
‘happpppyyyyy’. In English any word contains letters repeated twice to the
maximum. If a letter is repeated more than twice consecutively, the number of
its occurrence is reduced to two. Thus’happpppy’ becomes’happy’.


Noise data removal: By
noise data we mean the unwanted

data in the tweets like  URLs, hashed words, names etc. The URLS
present in the tweets are removed.


4) Parsing and tokenization: Once the data is
cleansed, Parsing and tokenization is done. Tokenization helps in part of sentence part of the
word in a sentence.  tokenization breaks
a stream of text into tokens, usually by looking for whitespace . A parser
takes the stream of tokens.


data mining :

data mining is use to find the
intensity  of crime using the Naive Bayes algorithm. before  apply the algorithm do the data
pre-processing .

1) Feature extraction

2) Normalization

3) data training and train module



Feature extraction:
after download the tweet , extracting specific tweet attributes
like  user name, location, time, re-tweet
count etc. All the extracted feature are store in database.


Normalization:  the extracted
tweet convert into the normal form for easy use and access .in normalization
all the attributes give the index .the attributes are further use as index.



data training and train module:

Algorithms learn from data. They
find relationships, make decisions, and evaluate their confidence from the
training data they’re given. And the better the training data is, the better
the model performs.

Data training apply on the data
set. New data is input to the train module and predict the output Intensity of data.


4) output  Classification

If the tweet is related to crime then
its divide into the type of crime. Mainly we use in the system 4 type of crime

1)  crime against parson

2)   crime against property

3)  crime against country

4)  other





we have chosen using negation algorithm as our main classifier, the
results are based on those experiments. For determining the accuracy of the
system we worked on a random set of sample 1000 tweets, from which 60% were no
crime and the rest were crime. Classes for these users were known already, out
of those 1000 tweets 93-95% were classified without mistake.





No crime




No crime