Data Formatter Module: Implementation Guide

by Dimemap Team 44 views

Hey everyone! Today, we're diving deep into the implementation of a crucial data formatter module (data_formatter.py). This module's primary job is to transform data from various sources—Salesforce, call centers, and SMS—into a unified DeepMost format, all while ensuring the data's quality through metadata validation. Think of it as the universal translator for your data, making sure everyone's speaking the same language!

Overview

The core objective of this module is to convert and standardize data from diverse sources into the DeepMost format. This involves not just reformatting the data but also validating it against predefined metadata requirements. This ensures that the data fed into subsequent processes is both consistent and reliable. Specifically, we'll be focusing on handling data from three main sources:

  1. SMS Conversations
  2. Salesforce Activities
  3. Call Center Transcripts

File Location

You'll find the module nestled comfortably in the src/data_formatter.py directory. This location helps maintain a clean project structure and makes it easy to locate and manage the module.

Key Functions

The module is built around several key functions, each designed to handle a specific aspect of data formatting and validation. Let's break down these functions and what they do:

  1. parse_sms_conversations(sms_df): This function is your go-to for parsing SMS data. It takes in a DataFrame containing SMS messages and transforms them into a structured conversation format.

  2. parse_salesforce_activities(activities_df): Salesforce data can be a beast, but this function tames it! It converts Salesforce opportunity activities into the desired DeepMost format.

  3. parse_call_transcripts(transcripts_df): Got call transcripts? This function is your friend. It parses call transcripts, complete with speaker labels, into a structured format.

  4. parse_transcript_text(transcript): This is a helper function that extracts turns (speaker-message pairs) from a given transcript. It's a crucial step in making sense of the raw transcript text.

  5. clean_message(message): Data cleaning is paramount, and this function is on the front lines. It cleans and normalizes messages by removing noise and formatting inconsistencies.

  6. validate_data(df): Last but not least, this function is the quality control officer. It validates the formatted data to ensure it meets the required standards and metadata specifications. It's like the final exam for your data!

Required Metadata Fields

Before we dive deeper, let's talk metadata. Metadata is data about data, and it's crucial for providing context and ensuring consistency. All formatted data must include these key fields:

Field Type Description Valid Range
lead_source string Origin of the lead "web", "referral", "partner", "cold_call"
debt_amount float Current debt in dollars ">= 0"
credit_score int Credit score 300-850
annual_income float Annual income in dollars "> 0"
state string US state (2-letter uppercase abbreviation) 2-letter uppercase

These metadata fields provide essential context for each data record, allowing for more effective analysis and decision-making. They act as the backbone for understanding the origin and financial status of each lead or customer.

Deep Dive into Required Metadata

Let's zoom in on why these specific metadata fields are so vital. The lead_source field, for instance, is a cornerstone for marketing analytics. Knowing where a lead originated (web, referral, partner, or cold_call) allows businesses to allocate resources more effectively. If leads from a particular source consistently convert at a higher rate, it makes sense to invest more in that channel.

Financial data points such as debt_amount and annual_income are critical for assessing the financial health of a potential customer. These figures help businesses gauge the customer's ability to repay a loan or afford a service. The valid range for debt_amount being ">= 0" reflects the practical reality that debt cannot be negative. Similarly, annual_income must be greater than zero, as a negative or zero income is not a viable scenario for most financial assessments.

credit_score, with its strict range of 300-850, is a key indicator of creditworthiness. This range aligns with standard credit scoring models, and any value outside this range is considered invalid. A customer's credit score is a primary factor in determining loan eligibility and interest rates.

Finally, the state field, requiring a 2-letter uppercase abbreviation, ensures consistency and accuracy in geographic data. This standardization is crucial for regional analysis and compliance with state-specific regulations.

In summary, these required metadata fields are not just arbitrary data points; they are carefully selected to provide a comprehensive snapshot of each customer, enabling informed business decisions and effective data analysis.

SMS Conversation Format

Now, let's get specific about the SMS conversation format. This is where we define how SMS data should be structured for optimal processing and analysis. Here's what you need to know:

Input CSV Columns

The input CSV files for SMS data should include the following columns:

  • conversation_id or opportunity_id or lead_id: A unique identifier for the conversation.
  • timestamp: The timestamp of the message.
  • message_text: The content of the SMS message.
  • is_from_sales_rep (boolean): A flag indicating whether the message was sent by a sales representative.
  • converted (0 or 1): An indicator of whether the conversation resulted in a conversion.
  • Required metadata columns: lead_source, debt_amount, credit_score, annual_income, state

These columns provide all the necessary information to reconstruct the conversation and associate it with relevant metadata.

Output Format

The output format for SMS conversations should be a CSV with the following structure:

conversation_id,conversation,converted,metadata,lead_source,debt_amount,credit_score,annual_income,state
"sms_conv_001","[{\"speaker\":\"customer\",\"message\":\"I need debt relief\"},{\"speaker\":\"sales_rep\",\"message\":\"I can help\"}]",1,"{\"lead_source\":\"web\",...\"}","web",15000.00,680,55000.00,"CA"

Let's break down the key columns in this output format:

  • conversation_id: A unique identifier for the SMS conversation. This helps link related messages together.
  • conversation: A JSON-formatted string representing the conversation turns. Each turn includes the speaker (customer or sales_rep) and the message text. This format allows for easy parsing and analysis of the conversation flow.
  • converted: A binary value (0 or 1) indicating whether the conversation led to a conversion. This is a crucial metric for evaluating the effectiveness of SMS communication.
  • metadata: A JSON-formatted string containing the metadata associated with the conversation. This includes information like lead source, debt amount, credit score, annual income, and state. Storing metadata in JSON format allows for flexibility and easy access to individual fields.
  • lead_source, debt_amount, credit_score, annual_income, state: These are individual metadata columns, included for easier analysis. Having these fields as separate columns can speed up querying and reporting.

By adhering to this output format, the data formatter module ensures that SMS conversation data is structured consistently, making it easier to process and analyze alongside other data sources.

Deep Dive into Conversation Structure

The conversation column's JSON format is particularly noteworthy. It structures the conversation as an array of turns, where each turn is a dictionary with speaker and message keys. This structure preserves the chronological order of the conversation and clearly identifies who said what.

For example:

[
    {"speaker": "customer", "message": "I need debt relief"},
    {"speaker": "sales_rep", "message": "I can help"}
]

This format is highly flexible and can accommodate conversations of any length. It also allows for easy extraction of individual turns for further analysis, such as sentiment analysis or keyword extraction. The speaker labels (customer and sales_rep) are essential for understanding the roles and perspectives within the conversation.

The inclusion of both the metadata column (as a JSON string) and the individual metadata columns (lead_source, debt_amount, etc.) reflects a balance between flexibility and ease of use. The JSON metadata column allows for the inclusion of additional metadata fields in the future without altering the schema. The individual metadata columns, on the other hand, provide quick access to commonly used metadata fields, improving query performance and simplifying reporting.

In essence, the SMS conversation format is designed to capture the richness and complexity of SMS interactions while ensuring consistency and ease of analysis. This careful structure is a key element in the data formatter module's ability to transform raw data into actionable insights.

Example: parse_sms_conversations Implementation

Let's dive into a practical example of how the parse_sms_conversations function can be implemented in Python using the pandas library. This function is the heart of SMS data formatting, and understanding its implementation is crucial.

import pandas as pd
import json

def parse_sms_conversations(sms_df: pd.DataFrame) -> pd.DataFrame:
    """
    Parse SMS messages into conversation format with metadata.
    
    Expected input columns:
    - conversation_id (or opportunity_id/lead_id)
    - timestamp
    - message_text
    - is_from_sales_rep (boolean)
    - converted (0 or 1)
    - REQUIRED metadata: lead_source, debt_amount, credit_score, annual_income, state
    
    Returns:
    - DataFrame with conversation_id, conversation, converted, metadata, individual metadata columns
    """
    # Define required metadata fields
    REQUIRED_METADATA = ['lead_source', 'debt_amount', 'credit_score', 'annual_income', 'state']
    
    # Validate required columns
    required_cols = {'conversation_id', 'timestamp', 'message_text', 'is_from_sales_rep', 'converted'}
    missing_required = required_cols - set(sms_df.columns)
    if missing_required:
        raise ValueError(f"Missing required columns: {missing_required}")
    
    missing_metadata = set(REQUIRED_METADATA) - set(sms_df.columns)
    if missing_metadata:
        raise ValueError(f"Missing required metadata columns: {missing_metadata}")
    
    # Identify all metadata columns (required + optional)
    metadata_cols = [col for col in sms_df.columns if col not in required_cols]
    
    conversations = []
    
    # Group by conversation
    for conv_id, group in sms_df.groupby('conversation_id'):
        # Sort by timestamp
        group = group.sort_values('timestamp')
        
        # Extract metadata from first row
        metadata = {}
        for col in metadata_cols:
            value = group[col].iloc[0]
            
            # Validate required metadata
            if col in REQUIRED_METADATA and pd.isna(value):
                raise ValueError(f"Missing {col} for conversation {conv_id}")
            
            # Type validation
            if col == 'credit_score':
                value = int(value)
                if not (300 <= value <= 850):
                    raise ValueError(f"Invalid credit_score {value} for {conv_id} (must be 300-850)")
            elif col == 'debt_amount':
                value = float(value)
                if value < 0:
                    raise ValueError(f"Invalid debt_amount {value} for {conv_id} (must be >= 0)")
            elif col == 'annual_income':
                value = float(value)
                if value <= 0:
                    raise ValueError(f"Invalid annual_income {value} for {conv_id} (must be > 0)")
            elif col == 'state':
                value = str(value).upper()
                if len(value) != 2:
                    raise ValueError(f"Invalid state {value} for {conv_id} (must be 2-letter code)")
            elif col == 'lead_source':
                value = str(value).lower()
            
            metadata[col] = value
        
        metadata['channel'] = 'sms'
        
        # Build conversation turns
        turns = []
        for _, row in group.iterrows():
            speaker = 'sales_rep' if row['is_from_sales_rep'] else 'customer'
            message = clean_message(row['message_text'])
            
            if message:
                turns.append({
                    'speaker': speaker,
                    'message': message
                })
        
        # Only include conversations with at least 2 turns
        if len(turns) >= 2:
            conv_record = {
                'conversation_id': f"sms_{conv_id}",
                'conversation': json.dumps(turns),
                'converted': int(group['converted'].iloc[0]),
                'metadata': json.dumps(metadata),
                # Also add as individual columns for easy analysis
                'lead_source': metadata['lead_source'],
                'debt_amount': metadata['debt_amount'],
                'credit_score': metadata['credit_score'],
                'annual_income': metadata['annual_income'],
                'state': metadata['state']
            }
            conversations.append(conv_record)
    
    return pd.DataFrame(conversations)

Dissecting the parse_sms_conversations Function

Let's break down what's happening in this code snippet. First, the function takes a pandas DataFrame (sms_df) as input, which is expected to contain SMS message data. It then defines a constant REQUIRED_METADATA to list the essential metadata fields.

The function begins by validating that all required columns and metadata fields are present in the input DataFrame. This is a crucial step for data quality. If any required columns are missing, the function raises a ValueError with a descriptive message, preventing further processing and potential errors.

Next, the function identifies all metadata columns, including both required and optional ones. It then initializes an empty list called conversations to store the processed conversation records.

The core logic of the function involves grouping the SMS messages by conversation_id. For each conversation, the messages are sorted by timestamp to maintain the correct order. The metadata for the conversation is extracted from the first row of the group. This assumes that metadata is consistent across all messages within the same conversation. The function then performs type validation and range checks on the metadata values, ensuring that they meet the specified criteria. For example, it checks that credit_score is within the range of 300-850 and that debt_amount is not negative. These checks are vital for data integrity.

After metadata extraction and validation, the function builds the conversation turns. It iterates through each row in the group, determines the speaker (customer or sales_rep), cleans the message text using the clean_message function (which we'll discuss later), and appends the turn to the turns list. Only conversations with at least two turns are included in the output. This filtering step helps ensure that the conversations are meaningful and not just single messages.

Finally, the function constructs a conversation record, which is a dictionary containing the conversation_id, the JSON-formatted conversation turns, the conversion status, the JSON-formatted metadata, and the individual metadata columns. This record is appended to the conversations list. The function returns a pandas DataFrame created from the conversations list. This DataFrame is the formatted output that can be used for further analysis and processing.

This step-by-step breakdown highlights the meticulous nature of the parse_sms_conversations function. It not only transforms the data but also validates it, ensuring that the output is of high quality and ready for use in downstream applications.

Example: clean_message Implementation

The clean_message function is a small but mighty tool in our data formatting arsenal. It's responsible for scrubbing the message text, removing noise, and ensuring consistency. Let's take a look at its implementation:

import re
import pandas as pd

def clean_message(message: str) -> str:
    """Clean and normalize message text"""
    if pd.isna(message):
        return ""
    
    # Remove extra whitespace
    message = ' '.join(message.split())
    
    # Remove common noise
    message = re.sub(r'${.*?}{{content}}#39;, '', message)  # Remove [brackets]
    message = re.sub(r'<.*?>', '', message)    # Remove <tags>
    
    return message.strip()

Unpacking the clean_message Function

This function takes a single argument, message, which is the text string to be cleaned. The first thing it does is check if the message is null or NaN using pd.isna(message). If it is, the function immediately returns an empty string. This prevents errors and ensures that null messages don't propagate through the system.

The next step is to remove extra whitespace. This is achieved by splitting the message into words using message.split() and then joining them back together with a single space using ' '.join(...). This effectively eliminates any leading, trailing, or multiple spaces within the message.

Then comes the noise removal. The function uses regular expressions (re module) to remove common noise patterns from the message. Specifically, it removes anything enclosed in square brackets ([brackets]) and angle brackets (<tags>). These patterns are often used for metadata or formatting tags within messages, which are not relevant for the core conversation content. The regular expressions r'${.*?}

and r'<.*?>' are used to match these patterns. The re.sub() function replaces the matched patterns with an empty string, effectively removing them from the message.

Finally, the function calls message.strip() to remove any leading or trailing whitespace that might have remained after the previous steps. This ensures that the cleaned message is free of unnecessary whitespace.

The clean_message function might seem simple, but it plays a crucial role in ensuring the quality of the formatted data. By removing noise and inconsistencies, it makes the message text more suitable for analysis and modeling.

The Importance of Text Cleaning

Text cleaning is a fundamental step in natural language processing (NLP) and data analysis. Raw text data often contains a lot of noise, such as extra whitespace, special characters, HTML tags, and other formatting elements. This noise can interfere with the accuracy and effectiveness of NLP models and analytical tools. By cleaning the text, we reduce the noise and make the data more consistent and reliable.

For example, if we were to perform sentiment analysis on the raw SMS messages, the presence of HTML tags or special characters could skew the results. Similarly, extra whitespace can affect the performance of text matching algorithms. By removing these elements, we ensure that the analysis is based on the actual content of the message, rather than on irrelevant noise.

Moreover, text cleaning improves the consistency of the data. By standardizing the format of the messages, we make it easier to compare and analyze them. For instance, removing extra whitespace ensures that two messages with the same content but different formatting are treated as equivalent.

In the context of the data formatter module, the clean_message function is a critical component in preparing the SMS messages for the DeepMost format. By ensuring that the message text is clean and consistent, it contributes to the overall quality of the formatted data and the accuracy of subsequent analyses.

Example: validate_data Implementation

Data validation is the final safety net in our data formatting process. The validate_data function checks the formatted data for quality issues and provides metrics on the data's overall health. Let's dissect its implementation:

import json
import pandas as pd
from typing import Dict

def validate_data(df: pd.DataFrame) -> Dict:
    """
    Validate formatted data quality including metadata.
    
    Returns quality metrics and issues found.
    """
    REQUIRED_METADATA = ['lead_source', 'debt_amount', 'credit_score', 'annual_income', 'state']
    
    issues = {
        'total_conversations': len(df),
        'missing_conversation': df['conversation'].isna().sum(),
        'missing_outcome': df['converted'].isna().sum(),
        'missing_metadata': 0,
        'invalid_credit_score': 0,
        'invalid_debt_amount': 0,
        'invalid_income': 0,
        'too_short': 0,
        'invalid_format': 0,
        'conversion_rate': df['converted'].mean() if len(df) > 0 else 0
    }
    
    # Check metadata columns exist
    for col in REQUIRED_METADATA:
        if col not in df.columns:
            issues['missing_metadata'] += 1
    
    # Validate metadata values
    if 'credit_score' in df.columns:
        issues['invalid_credit_score'] = ((df['credit_score'] < 300) | (df['credit_score'] > 850)).sum()
    if 'debt_amount' in df.columns:
        issues['invalid_debt_amount'] = (df['debt_amount'] < 0).sum()
    if 'annual_income' in df.columns:
        issues['invalid_income'] = (df['annual_income'] <= 0).sum()
    
    # Check conversation format and length
    for idx, row in df.iterrows():
        try:
            conv = json.loads(row['conversation'])
            if len(conv) < 2:
                issues['too_short'] += 1
            
            # Check required keys
            for turn in conv:
                if 'speaker' not in turn or 'message' not in turn:
                    issues['invalid_format'] += 1
                    break
        except json.JSONDecodeError:
            issues['invalid_format'] += 1
    
    return issues

Deconstructing the validate_data Function

This function takes a pandas DataFrame (df) as input, which is expected to contain the formatted data. It returns a dictionary (Dict) containing quality metrics and issues found in the data.

The function starts by defining the REQUIRED_METADATA list, just like in the parse_sms_conversations function. It then initializes a dictionary called issues. This dictionary will store the various quality metrics and issue counts. Let's take a closer look at the keys in this dictionary:

After initializing the issues dictionary, the function proceeds to perform various validation checks. First, it checks for missing metadata columns. It iterates through the REQUIRED_METADATA list and increments the missing_metadata count for each missing column.

Next, the function validates the metadata values. It checks for invalid credit_score, debt_amount, and annual_income values, using the same validation rules as in the parse_sms_conversations function. The counts of invalid values are stored in the corresponding keys in the issues dictionary.

The function then checks the conversation format and length. It iterates through each row in the DataFrame using df.iterrows(). For each row, it attempts to parse the conversation column as JSON using json.loads(). If the parsing fails (i.e., a json.JSONDecodeError is raised), the invalid_format count is incremented. If the parsing is successful, the function checks if the conversation has fewer than 2 turns. If it does, the too_short count is incremented. The function also checks if each turn in the conversation has the required keys (speaker and message). If any turn is missing these keys, the invalid_format count is incremented.

Finally, the function returns the issues dictionary, which contains a comprehensive set of quality metrics and issue counts. This dictionary can be used to assess the overall quality of the data and to identify specific areas that need attention.

The Value of Data Validation

The validate_data function is a crucial component of the data formatter module because it ensures that the formatted data meets the required quality standards. By identifying issues such as missing values, invalid metadata, and incorrect conversation formats, it helps prevent errors in downstream analyses and models.

Data validation is particularly important in data-driven decision-making. If the data is flawed, the decisions based on it may also be flawed. By validating the data, we increase the confidence in the insights derived from it and improve the quality of the decisions made.

The metrics returned by the validate_data function can also be used to track the data quality over time. By monitoring these metrics, we can identify trends and patterns that may indicate underlying issues in the data collection or formatting processes. This allows us to proactively address these issues and maintain the data quality.

In summary, the validate_data function is a critical safeguard in the data formatting pipeline. It ensures that the formatted data is accurate, consistent, and reliable, making it a valuable asset for analysis and decision-making.

Testing

Testing is a critical part of the development process. It ensures that our code works as expected and that we catch any bugs or issues early on. Here’s an example of how you can test the parse_sms_conversations and validate_data functions:

import pandas as pd
from src.data_formatter import parse_sms_conversations, validate_data

# Test with sample data
test_data = {
    'conversation_id': ['conv_001', 'conv_001', 'conv_001'],
    'timestamp': ['2024-01-01 10:00', '2024-01-01 10:01', '2024-01-01 10:02'],
    'message_text': ['I need help', 'How can I help?', 'I need debt relief'],
    'is_from_sales_rep': [False, True, False],
    'converted': [1, 1, 1],
    'lead_source': ['web', 'web', 'web'],
    'debt_amount': [15000.0, 15000.0, 15000.0],
    'credit_score': [680, 680, 680],
    'annual_income': [55000.0, 55000.0, 55000.0],
    'state': ['CA', 'CA', 'CA']
}

df = pd.DataFrame(test_data)
formatted = parse_sms_conversations(df)
print(formatted)

# Validate
issues = validate_data(formatted)
print(issues)

This code snippet creates a sample DataFrame with SMS conversation data. It then calls the parse_sms_conversations function to format the data and prints the formatted DataFrame. Finally, it calls the validate_data function to validate the formatted data and prints the issues dictionary. This simple test can help you verify that the functions are working correctly and that the formatted data meets the required quality standards.

Acceptance Criteria

To ensure that the data formatter module meets the required standards, the following acceptance criteria should be met:

These criteria provide a clear checklist for verifying the functionality and quality of the module. Meeting these criteria ensures that the module is robust, reliable, and ready for use in production.

Dependencies

The data formatter module relies on the following Python libraries:

Make sure these libraries are installed in your environment before using the module. You can install them using pip:

pip install pandas json

The re library is part of the Python standard library, so you don't need to install it separately.

Notes

Here are a few important notes to keep in mind when working with the data formatter module:

These notes highlight key aspects of the module's design and usage. They emphasize the importance of metadata for analytics, the consistency of the output structure, the early validation of data, and the availability of detailed specifications for implementation.

Conclusion

Alright, folks! We've covered a lot of ground in this guide. Implementing a robust data formatter module is no small feat, but with a clear understanding of the requirements and the right tools, it's definitely achievable. Remember, the goal is to transform raw data into a consistent, validated format that can be used for analysis and decision-making. By following the guidelines and best practices outlined in this guide, you can build a data formatter module that meets your needs and contributes to the success of your projects. Keep formatting, keep validating, and keep those insights coming!