Big data introductio for Aktu
Introduction to Big Data
Data is a set of values of qualitative or quantitative variables, restated, pieces of data are individual pieces of information. Data can be collected, measured, & reported.
Properties of data:
- Clarity
- Accuracy
- Compression
- Ability of use
- Refinement
Digital Data:
Digital data refers to any information or content that is stored in a digital format. Includes types of digital files: documents, images, videos, audio recordings, databases, etc. Digital data is encoded as strings of binary digits (bits) and can be transmitted, processed using electronic devices and computer systems.
Types of Digital Data:
- Structured digital data
- Semi-structured digital data
- Unstructured digital data
Structured Digital Data:
Refers to data that is highly organized and stored in a fixed format, usually in databases. Examples include spreadsheets, tables, and relational databases.
Semi-structured Digital Data:
Refers to data that has some organization but does not fit into a traditional structured data model. It has some identifiable structure, such as tags or keys. Examples include XML files, NoSQL databases, and JSON documents.
Unstructured Digital Data:
Refers to data that has no predefined structure or organization. It is often free-form text, audio, or video content. Examples include text messages, audio recordings, social media posts, and images.
Big Data:
Big data refers to the large and complex volumes of data that are generated in today's digital age. A big data platform combines the features and capabilities of several big data applications and is characterized by its "V's". It is used to gain insights and knowledge to help organizations make better decisions.
Architecture of Big Data:
Refers to the set of technologies, tools, and techniques used to design, develop, and deploy big data systems.
- Analytics and reporting
- Data Storage
- Data Source
- Real-time message ingestion
- Batch processing
- Analytics data Store
- Stream Processing
- Orchestration
5V's of Big Data:
- Volume
- Variety
- Velocity
- Value
- Veracity
The five V's of big data provide a framework for understanding the unique characteristics of big data.
Components of Big Data Technology:
- Machine Learning: The science of making computers learn things by themselves.
- Natural Language Processing (NLP): The ability of a computer to understand human language.
- Business Intelligence (BI): Technology for analyzing data and delivering actionable information.
- Cloud Computing: Delivery of computing services over the Internet.
Big Data Analytics:
Refers to the process of examining large and complex data sets to uncover patterns and correlations. It involves collecting data from different sources and managing it for analysis.
Applications include:
- Marketing & Advertising
- Health Care
- Finance
- Manufacturing
Advantages:
- Detects and corrects errors from data sets using data cleansing.
- Improves quality of data.
- Removes duplicate information from data sets.
Intelligent Data Analysis (IDA):
A decision support process that uses advanced analytical techniques to analyze and extract valuable information from data sets. The goal is to identify patterns, relationships, and trends in data.
Techniques of IDA:
- Cluster analysis: Identifying groups of similar data points.
- Classification: Identifying the class or category of data points.
- Regression analysis: Identifying relationships between variables in data.
- Neural Networks: Identifying patterns in data and making predictions.
Modern Data Analytics Tools:
Modern analytics tools concentrate on three classes:
- Batch Processing Tools: Automate the processing of large volumes of data in batches.
- Stream Processing Tools: Predict data trends in real-time.
- Interactive Analysis Tools: Enable users to interact with and explore large data sets in real-time.
Examples of tools:
- Batch Processing Tools: Apache Hadoop, Apache Mahout, Talend Open Studio
- Stream Processing Tools: Apache Storm, Apache Flink, Amazon Kinesis
- Interactive Analysis Tools: Google's Dremel, Tableau, Apache Superset, Power BI