Big Data
Work through every question currently mapped to this canonical topic.
- Explain what is exactly Big Data
Answer
As defined by Doug Laney:
- Volume: Extremely large volumes of data
- Velocity: Real time, batch, streams of data
- Variety: Various forms of data, structured, semi-structured and unstructured
- Veracity or Variability: Inconsistent, sometimes inaccurate, varying data
- What is DataOps? How is it related to DevOps?
Answer
DataOps seeks to reduce the end-to-end cycle time of data analytics, from the origin of ideas to the literal creation of charts, graphs and models that create value. DataOps combines Agile development, DevOps and statistical process controls and applies them to data analytics.
- What is Data Architecture?
Answer
An answer from talend.com:
"Data architecture is the process of standardizing how organizations collect, store, transform, distribute, and use data. The goal is to deliver relevant data to people who need it, when they need it, and help them make sense of it."
- Explain the different formats of data
Answer
- Structured - data that has defined format and length (e.g. numbers, words)
- Semi-structured - Doesn't conform to a specific format but is self-describing (e.g. XML, SWIFT)
- Unstructured - does not follow a specific format (e.g. images, test messages)
- What is a Data Warehouse?
- What is Data Lake?
Answer
- Can you explain the difference between a data lake and a data warehouse?
🚧 Answer not written yet.
- What is "Data Versioning"? What models of "Data Versioning" are there?
🚧 Answer not written yet.
- What is ETL?
🚧 Answer not written yet.
- Explain what is Hadoop
Answer
- Explain Hadoop YARN
Answer
Responsible for managing the compute resources in clusters and scheduling users' applications
- Explain Hadoop MapReduce
Answer
A programming model for large-scale data processing
- Explain Hadoop Distributed File Systems (HDFS)
Answer
- Distributed file system providing high aggregate bandwidth across the cluster.
- For a user it looks like a regular file system structure but behind the scenes it's distributed across multiple machines in a cluster
- Typical file size is TB and it can scale and supports millions of files
- It's fault tolerant which means it provides automatic recovery from faults
- It's best suited for running long batch operations rather than live analysis
- What do you know about HDFS architecture?
Answer
- Master-slave architecture
- Namenode - master, Datanodes - slaves
- Files split into blocks
- Blocks stored on datanodes
- Namenode controls all metadata