Data Types: Structured, Unstructured, and Semi-Structured Data

Different data types, their characteristics, file formats, and applications in modern data processing
Author

Ravi Kalia

Published

February 25, 2025

Introduction

Data can be broadly categorized into three main types: structured, unstructured, and semi-structured. Each type has its own characteristics, storage formats, and processing requirements. In this post, we’ll explore these data types in detail, with practical examples and real-world applications.

Structured Data

Structured data is highly organized and follows a predefined schema or model. It’s typically stored in relational databases and can be easily processed using traditional data processing tools.

Characteristics

  • Predefined schema
  • Tabular format
  • Easy to query and analyze
  • Strong data integrity
  • Well-defined relationships

Common File Formats

CSV (Comma-Separated Values)

Code
import pandas as pd
import io
# Example CSV data
csv_data = """id,name,age,department
1,John Doe,30,Engineering
2,Jane Smith,25,Marketing
3,Bob Johnson,35,Finance"""

# Read CSV data
df = pd.read_csv(io.StringIO(csv_data))
print(df)
   id         name  age   department
0   1     John Doe   30  Engineering
1   2   Jane Smith   25    Marketing
2   3  Bob Johnson   35      Finance

SQL Databases

Code
import sqlite3

# Create a simple SQLite database
conn = sqlite3.connect(':memory:')
cursor = conn.cursor()

# Create table
cursor.execute('''
    CREATE TABLE employees (
        id INTEGER PRIMARY KEY,
        name TEXT,
        age INTEGER,
        department TEXT
    )
''')

# Insert data
cursor.execute('''
    INSERT INTO employees (id, name, age, department)
    VALUES (1, 'John Doe', 30, 'Engineering')
''')

# Query data
cursor.execute('SELECT * FROM employees')
print(cursor.fetchall())
[(1, 'John Doe', 30, 'Engineering')]

Applications

  • Financial transactions
  • Customer relationship management (CRM)
  • Inventory management
  • Human resources systems
  • Traditional business applications

Unstructured Data

Unstructured data lacks a predefined data model and doesn’t fit neatly into traditional databases. It’s typically text-heavy but may also contain dates, numbers, and facts.

Characteristics

  • No predefined schema
  • Difficult to process using traditional methods
  • Requires specialized tools for analysis
  • Often contains rich information
  • Flexible but complex to manage

Common File Formats

Text Documents

Code
# Example text document
text_document = """
Project Report
Date: 2024-04-09
Author: John Doe

Summary:
This report analyzes the performance of our new product line.
Customer feedback has been overwhelmingly positive, with particular
emphasis on the improved user interface and faster processing times.

Key Findings:
1. 85% customer satisfaction rate
2. 40% increase in user engagement
3. Reduced processing time by 60%
"""

Images

Code
from PIL import Image
import numpy as np

# Example image processing
# Note: This is a placeholder - actual image processing would require an image file
def process_image(image_path):
    img = Image.open(image_path)
    img_array = np.array(img)
    return img_array.shape

Audio Files

Code
import librosa

# Example audio processing
# Note: This is a placeholder - actual audio processing would require an audio file
def process_audio(audio_path):
    y, sr = librosa.load(audio_path)
    return y.shape, sr

Applications

  • Natural language processing
  • Computer vision
  • Speech recognition
  • Social media analysis
  • Document management systems

Semi-Structured Data

Semi-structured data doesn’t conform to the formal structure of data models but contains tags or markers to separate semantic elements and enforce hierarchies of records and fields.

Characteristics

  • Flexible schema
  • Self-describing
  • Contains tags or markers
  • Hierarchical structure
  • Easier to process than unstructured data

Common File Formats

JSON (JavaScript Object Notation)

Code
import json

# Example JSON data
json_data = {
    "employee": {
        "id": 1,
        "name": "John Doe",
        "age": 30,
        "department": "Engineering",
        "skills": ["Python", "SQL", "Machine Learning"],
        "projects": [
            {
                "name": "AI Implementation",
                "status": "Completed"
            },
            {
                "name": "Data Pipeline",
                "status": "In Progress"
            }
        ]
    }
}

# Convert to JSON string
json_string = json.dumps(json_data, indent=2)
print(json_string)
{
  "employee": {
    "id": 1,
    "name": "John Doe",
    "age": 30,
    "department": "Engineering",
    "skills": [
      "Python",
      "SQL",
      "Machine Learning"
    ],
    "projects": [
      {
        "name": "AI Implementation",
        "status": "Completed"
      },
      {
        "name": "Data Pipeline",
        "status": "In Progress"
      }
    ]
  }
}

XML (eXtensible Markup Language)

Code
import xml.etree.ElementTree as ET

# Example XML data
xml_data = """
<employee>
    <id>1</id>
    <name>John Doe</name>
    <age>30</age>
    <department>Engineering</department>
    <skills>
        <skill>Python</skill>
        <skill>SQL</skill>
        <skill>Machine Learning</skill>
    </skills>
</employee>
"""

# Parse XML
root = ET.fromstring(xml_data)
for child in root:
    print(f"{child.tag}: {child.text}")
id: 1
name: John Doe
age: 30
department: Engineering
skills: 
        

YAML (YAML Ain’t Markup Language)

Code
import yaml

# Example YAML data
yaml_data = """
employee:
  id: 1
  name: John Doe
  age: 30
  department: Engineering
  skills:
    - Python
    - SQL
    - Machine Learning
  projects:
    - name: AI Implementation
      status: Completed
    - name: Data Pipeline
      status: In Progress
"""

# Parse YAML
data = yaml.safe_load(yaml_data)
print(yaml.dump(data, default_flow_style=False))
employee:
  age: 30
  department: Engineering
  id: 1
  name: John Doe
  projects:
  - name: AI Implementation
    status: Completed
  - name: Data Pipeline
    status: In Progress
  skills:
  - Python
  - SQL
  - Machine Learning

Applications

  • Web APIs
  • Configuration files
  • Data exchange between systems
  • NoSQL databases
  • Log files

Data Processing Considerations

Storage

  • Structured: Relational databases (MySQL, PostgreSQL)
  • Unstructured: Object storage (S3, Blob storage)
  • Semi-structured: Document databases (MongoDB, Couchbase)

Processing Tools

  • Structured: SQL, pandas, traditional ETL tools
  • Unstructured: NLP libraries, computer vision tools, audio processing libraries
  • Semi-structured: JSON/XML parsers, NoSQL databases

Analysis Approaches

  • Structured: Statistical analysis, business intelligence
  • Unstructured: Machine learning, deep learning
  • Semi-structured: Hybrid approaches combining structured and unstructured methods

Real-World Examples

E-commerce Platform

Code
# Example of mixed data types in e-commerce
ecommerce_data = {
    "order": {
        "order_id": "ORD12345",  # Structured
        "customer": {
            "name": "Jane Smith",
            "email": "jane@example.com"
        },
        "items": [
            {
                "product_id": "PRD001",
                "name": "Wireless Headphones",
                "price": 99.99,
                "reviews": [  # Unstructured
                    "Great sound quality!",
                    "Battery life could be better."
                ]
            }
        ],
        "shipping_address": {  # Semi-structured
            "street": "123 Main St",
            "city": "New York",
            "state": "NY",
            "zip": "10001"
        }
    }
}

Healthcare System

Code
# Example of mixed data types in healthcare
healthcare_data = {
    "patient": {
        "patient_id": "PAT123",  # Structured
        "name": "John Doe",
        "age": 45,
        "medical_history": [  # Semi-structured
            {
                "date": "2024-01-15",
                "diagnosis": "Hypertension",
                "treatment": "Medication A"
            }
        ],
        "doctor_notes": """  # Unstructured
        Patient presented with elevated blood pressure.
        Recommended lifestyle changes and prescribed medication.
        Follow-up scheduled in 2 weeks.
        """
    }
}

Conclusion

Understanding different data types is essential for: 1. Choosing appropriate storage solutions 2. Selecting the right processing tools 3. Designing efficient data pipelines 4. Implementing effective data analysis strategies

Modern applications often deal with a mix of data types, requiring flexible and scalable solutions. The key is to understand the characteristics of each data type and choose the appropriate tools and techniques for processing and analysis.

Remember that data types are not mutually exclusive - many real-world applications require handling multiple data types simultaneously. The ability to work with different data types and choose the right tools for each is a crucial skill in modern data engineering and analysis.