In of advanced analytics techniques on big data.

In the current era of information, huge volumes of
data have become available on hand for decision making. So there is need of
tools that implement certain methods to mine such type of data and medium for
storage. The rate at which it is growing is increasing the size every second.
This term is thus coined as Big data which is not only big in size, but also
highly different in terms of variety and velocity which makes them difficult to
handle using traditional tools and techniques. Due to the rapid growth of such data,
it is required to find solutions to handle such data and extract interesting
patterns in order to gain knowledge from these datasets using Data Mining
algorithms Such value can be provided using big data analytics, which is the
application of advanced analytics techniques on big data. The emergence of Big
Data has given rise to many security and privacy issues that need to be
handled. Otherwise, Big Data Analytics will not fulfil the needs and
opportunities.

 

Keywords: Big Data, security, privacy

 

1.     
Introduction

 

Thinking
of a world without data storage, without social websites, without banking transactions
is next to impossible. We see that in today’s information era we are surrounded
with complex and large data storage. In social websites like facebook , twitter
there is a need for storing all profile details, who liked your picture on
facebook, who commented on that and what comment has been posted, all
information is to be stored in some place.

 

A
new type of analysis methods, different storage and visualization techniques
are required to analyse such sheer amount of data and visualize the extracted
patterns. John Mashey was the person who introduced the term Big Data in a
Silicon Graphics(SGI) slide deck in 1998.

 

With
the advancements in science and technology, increase in number of internet
users, thus more social interactions, digital recording of data, data from
sensors, medical field, pictures from geostationary satellite are adding data
to the rapidly growing dataset. But as is said, everything comes with pros and
cons .The irregularity and uncertainty in datasets due to different formats,
sources, storage using cloud to promote sharing of resources. The use of large number
of software platforms in cloud infrastructure has increased probability of
attacks in the entire system. The different types of problems that has raised
with emergence of Big data have been discussed in the following sections.

 

The
term “Big data” refers to a
collection of data sets so large and complex that it becomes difficult
to process using on-hand database management tools or traditional data
processing applications.

Big
Data has much bigger and wider pool of organizations than these big companies
only. It has been extended to any company and government agencies that depend
on datasets of Big Data for statistical algorithms and different data mining techniques
to analyse these large datasets and ultimately improving decision making and
enhancing efficiency to take better decisions. The various sources which are
adding data to the datasets are listed below:

·        
Media/entertainment:
The media and the entertainment industry is playing a role in increasing the
volumes and variety of data either in form of text , jpeg files, twitter posts
or videos. They have started recording and delivering everything digitally
which requires modern processing tools.

·        
Medical
: The healthcare industry is recording the data in electronic medical records
and images, which is used for health monitoring and epidemiological research
programs

·        
Video
surveillance: In the present scenario, the security services has enhanced which
is helping out industries to analyse their data in a better manner. The way
videos were recorded for surveillance has transited from CCTV to (Closed
Circuit Television) to IPTV (Internet Protocol Television) cameras.

·        
Logistics,
retail, utilities, and telecommunications: GPS transceivers,, Radio Frequency
Identification tags and cell phones are generating data through their embedded
sensors. This data needs to be stored and handled so that it can be used by
industries to optimise their business related activities and enhance
operational Business Intelligence.

·        
Data
through Social Networks and Location trackers: Social Networking sites and
mobile applications like Facebook, Twitter, Flipkart and other online shopping
sites, Google Maps are generating data in various data formats like comments
and likes to any posted photo or tweets by celebrities. All these social
networks are providing free services to their users. Internet Users are sharing
photos, videos and are blogging to keep themselves in touch with their friends.

 

Fig.1: Sources of Big Data

 

2.     
Characteristics
of Big Data

 

Big
Data are datasets that requires new analytical and processing tools to extract
patterns from such highly scalable, diversified and distributed data. Three
main features characterize big data: volume, variety, and velocity, or the
three V’s.

1.      
Volume: Volume is always the first feature that comes into mind
whenever Big Data is the topic of discussion.  There is general agreement
that if volume is in the gigabytes it is probably not Big Data, but at the
terabyte and petabyte level and beyond it may very well be.  Volume of
data is the actual reason why Relational database management systems cannot be
used or have failed to analyse Big Data.  Apart from issue of being big, the
other issues include the different formats that is a mix of structured,
unstructured and semi-structured data, complexity, cost and reliability. The
real time data from sensors and devices often termed as IoT, tweets from
Twitter, experimental data from research labs, data of customers ordering pizza
from Dominos, transactional data from shopping sites and banks including all
payment information, you tube videos and many more are generating almost petabytes
of data every second which is increasing the volume at an exponential rate.

2.      
Variety: Variety describes the different formats in which data is
generating that do not allow themselves to be stored in structured relational
database systems.  These include a long list of data, whether documents or
text data in pdf, excel or docx format, emails, audios and videos, activity
records from electronic devices messages from social media in form of images,
tweets and other text messages the output from all types of machine-generated
data from sensors, devices, RFID tags, machine logs, cell phone GPS signals,
stock prices to their purchase histories and much more. Storage and
retrieval of data of different types in cost efficient manner and visualizing
the extracted patterns to take decisions is a challenge for data analysts.

3.      
Velocity: Velocity defines the data in motion that is moving or emerging
every second at a rapid rate. For example, the stream of the web log history of
page visits and clicks by each visitor to a web site or readings taken from a
sensor.  This can be thought of as data coming from some sort of pipeline
that needs to be captured, stored, and analysed so that it can be used for
strategic decision making by the top level management.  Consistency and
completeness of fast moving streams of data are one concern.  Matching
them to specific outcome events, a challenge raised under Variety is another. Timeliness
or latency can be incorporated as characteristics of data that somehow defines velocity.

 

The other two V’s which
describes Big data are Veracity and Value. Veracity signifies the quality of
data. As inaccurate and noisy data having uncertainties is useless, veracity
also refers to the trustworthiness of data. Value on the other hand defines the
importance of data or its business value for any organisation in monetary terms

 

Fig.2:
Characteristics of Big Data

 

3.      Big Data
Analytics Tools and Methods

Faster
and efficient methods are required to handle the multitudes of data flowing in
and out of organizations daily. Traditional techniques for data management and analysis
have failed to handle and mine such noisy data sets. Therefore, there arises a
need for new tools and methods specialized for big data analytics, as well as
the required architectures for storing and managing such data. Accordingly, the
emergence of big data has an effect on everything from the data itself and its
collection, to the processing, to the final extracted decisions. The main areas
where Big data differs from normal data sets are the way it is processed, the
amount of storage required and the techniques which can be used to extract
patterns for decision making. Hadoop framework was introduced for Big Data Analytics.
MapReduce algorithm has been implemented on Hadoop where a the analysis takes
place in two parts that is mapping and reducing. Various other tools like R,
MongoDB, Cloudera are available for analysis of Big Data.

4.      Issues and
Challenges

 

1) Fault Tolerance:  The fault tolerance defines that the damage in
case of failure should be minimum that is under threshold level so that only a
subset of the whole task needs to be redone. This can be achieved by dividing
the problem into certain parts and assigning each subset to a node which are
then made to work in parallel mode. The checkpoints can be inserted or applied
at regular intervals.

2) Data Quality: As
discussed about veracity which is nothing but the quality of data, is an
important factor. The big data that are so large in volume, quality should be
looked into as there is no sense of wasting storage by storing low quality and irrelevant
data which will be resulting into useless patterns and conclusions.

3) Scalability: The increase
in number of Internet users and thus rise in scalability of Big data has led to
growth of cloud computing which allows sharing of expensive resources and
processing of large volumes of data into large clusters in distributed manner
to increase the performance. Solid state devices have replaced the hard disk
drives but their performance for transferring data randomly and sequentially is
not same. So the decision of storage device is a big challenge from analytics
point of view.

4) Security and Privacy
issues: The use of social media at such a large extent has posed various
security and privacy. Following information gives an overview of security and
privacy issues involved in Big Data environment:

·        
The
access of social media in terms of text documents, images at shopping sites, songs
in form of audios , youtube videos, online money transfer using netbanking
facilities etc, is increasing at such a rapid rate that it seems to be very
difficult to ensure pricy of personal data. The use of Internet has increased
the threat of cyber attacks resulting into internet attack at every 10 minutes
in our country. Various applications are keeping a look at our device’s
location.  

5) Processing Issues: New
Analytic algorithms and parallel processing is required for effective and rapid
processing of Big Data. One of the challenges is to find out important data
points from which useful patterns and maximum benefit out of it can be
extracted.

6) Storage and Transport
Issues

Big data processing
issue has been well explained by the author of 4 by a very good example. Each
time a new storage medium is invented the quantity of data becomes more and
more.The transfer of data from storage device to processing point for analysis at
almost 1 gigabyte per second having an effective transfer rate of 80% needs 100
megabytes of bandwidth.

 

Fig.3: Processing in Big Data Environment

5.      Conclusion and Future
Scope

Since the
time PC was invented by Steve Jobs, data is the biggest thing to hit the
industry.  In this
research, we have reviewed the innovative topic of big data, which is recently
the most researched area of IT industry as it is revealing remarkable and
unusual opportunities. The
increase in count of internet users and advancements in technology leading to
low storage costs has made Big Data the most researched topic. The PC changed
the world now the Data movement is doing the same. The future of Big data is
concerned with large volumes due to exponential growth in the number of
portable and handheld devices. Various programs like Spark and Kafka have
enabled users to take decisions in real time. Data Mining and many of its
techniques like Binning, Normalization, Sampling has been used to pre-process
data and transform it eliminating outliers and other uncertainties. The use of
Internet is growing security challenges to our personal data. With the increase
in embedded sensors in devices, somewhat defined as Internet of things leading
to communication between devices in a network has open a path for automated vehicles, robots which
would going to be a trend in future.

Big data has
find its applications in many areas like customer segmentation, transportation,
biomedical, geostationary, retail purchase, telecom and manufacturing.
Industries and organisations need to focus on and train their employees to work
on tools and techniques that can be used to process data having varied formats
to enhance good decision making by taking into account the hidden and unknown
patterns extracted on mining the voluminous datasets. Big data analytics is of
great importance and if utilized properly, it can lead to technological and
scientific levels.

 

6.      References

 

1. Adams, M.N.:
Perspectives on Data Mining. International Journal of Market Research52(1),
11–19 (2010)

 

2. Asur, S., Huberman,
B.A.: Predicting the Future with Social Media. In: ACM InternationalConference
on Web Intelligence and Intelligent Agent Technology, vol. 1, pp. 492–499(2010)

 

3. Bakshi, K.:
Considerations for Big Data: Architecture and Approaches. In: Proceedings ofthe
IEEE Aerospace Conference, pp. 1–7 (2012)

 

4. Cebr: Data equity,
Unlocking the value of big data. in: SAS Reports, pp. 1–44 (2012)

 

5. Cohen, J., Dolan, B., Dunlap,
M., Hellerstein, J.M., Welton, C.: MAD Skills: New Analy-

sis Practices for Big
Data. Proceedings of the ACM VLDB Endowment 2(2), 1481–1492(2009)

 

6. Cuzzocrea, A., Song,
I., Davis, K.C.: Analytics over Large-Scale Multidimensional Data:The Big Data
Revolution! In: Proceedings of the ACM International Workshop on
DataWarehousing and OLAP, pp. 101–104 (2011)

 

7. Economist Intelligence
Unit: The Deciding Factor: Big Data & Decision Making. In:Capgemini
Reports, pp. 1–24 (2012)Big Data Analytics: A Literature Review Paper 227

 

8. Elgendy, N.: Big Data
Analytics in Support of the Decision Making Process. MSc Thesis,German
University in Cairo, p. 164 (2013)

 

9. EMC: Data Science and
Big Data Analytics. In: EMC Education Services, pp. 1–508(2012)

 

10. He, Y., Lee, R.,
Huai, Y., Shao, Z., Jain, N., Zhang, X., Xu, Z.: RCFile: A Fast and Space -efficient
Data Placement Structure in MapReduce-based Warehouse Systems. In:
IEEEInternational Conference on Data Engineering (ICDE), pp. 1199–1208 (2011)

 

11. Herodotou, H., Lim,
H., Luo, G., Borisov, N., Dong, L., Cetin, F.B., Babu, S.: Starfish:
ASelf-tuning System for Big Data Analytics. In: Proceedings of the Conference
on Innovative Data Systems Research, pp. 261–272 (2011)