Data Formatter Module: Implementation Guide

Oct 15, 2025 by Dimemap Team 44 views

Hey everyone! Today, we're diving deep into the implementation of a crucial data formatter module (data_formatter.py). This module's primary job is to transform data from various sources—Salesforce, call centers, and SMS—into a unified DeepMost format, all while ensuring the data's quality through metadata validation. Think of it as the universal translator for your data, making sure everyone's speaking the same language!

Overview

The core objective of this module is to convert and standardize data from diverse sources into the DeepMost format. This involves not just reformatting the data but also validating it against predefined metadata requirements. This ensures that the data fed into subsequent processes is both consistent and reliable. Specifically, we'll be focusing on handling data from three main sources:

SMS Conversations
Salesforce Activities
Call Center Transcripts

File Location

You'll find the module nestled comfortably in the src/data_formatter.py directory. This location helps maintain a clean project structure and makes it easy to locate and manage the module.

Key Functions

The module is built around several key functions, each designed to handle a specific aspect of data formatting and validation. Let's break down these functions and what they do:

parse_sms_conversations(sms_df): This function is your go-to for parsing SMS data. It takes in a DataFrame containing SMS messages and transforms them into a structured conversation format.
parse_salesforce_activities(activities_df): Salesforce data can be a beast, but this function tames it! It converts Salesforce opportunity activities into the desired DeepMost format.
parse_call_transcripts(transcripts_df): Got call transcripts? This function is your friend. It parses call transcripts, complete with speaker labels, into a structured format.
parse_transcript_text(transcript): This is a helper function that extracts turns (speaker-message pairs) from a given transcript. It's a crucial step in making sense of the raw transcript text.
clean_message(message): Data cleaning is paramount, and this function is on the front lines. It cleans and normalizes messages by removing noise and formatting inconsistencies.
validate_data(df): Last but not least, this function is the quality control officer. It validates the formatted data to ensure it meets the required standards and metadata specifications. It's like the final exam for your data!

Required Metadata Fields

Before we dive deeper, let's talk metadata. Metadata is data about data, and it's crucial for providing context and ensuring consistency. All formatted data must include these key fields:

Field	Type	Description	Valid Range
`lead_source`	string	Origin of the lead	"web", "referral", "partner", "cold_call"
`debt_amount`	float	Current debt in dollars	">= 0"
`credit_score`	int	Credit score	300-850
`annual_income`	float	Annual income in dollars	"> 0"
`state`	string	US state (2-letter uppercase abbreviation)	2-letter uppercase

These metadata fields provide essential context for each data record, allowing for more effective analysis and decision-making. They act as the backbone for understanding the origin and financial status of each lead or customer.

Deep Dive into Required Metadata

Let's zoom in on why these specific metadata fields are so vital. The lead_source field, for instance, is a cornerstone for marketing analytics. Knowing where a lead originated (web, referral, partner, or cold_call) allows businesses to allocate resources more effectively. If leads from a particular source consistently convert at a higher rate, it makes sense to invest more in that channel.

Financial data points such as debt_amount and annual_income are critical for assessing the financial health of a potential customer. These figures help businesses gauge the customer's ability to repay a loan or afford a service. The valid range for debt_amount being ">= 0" reflects the practical reality that debt cannot be negative. Similarly, annual_income must be greater than zero, as a negative or zero income is not a viable scenario for most financial assessments.

credit_score, with its strict range of 300-850, is a key indicator of creditworthiness. This range aligns with standard credit scoring models, and any value outside this range is considered invalid. A customer's credit score is a primary factor in determining loan eligibility and interest rates.

Finally, the state field, requiring a 2-letter uppercase abbreviation, ensures consistency and accuracy in geographic data. This standardization is crucial for regional analysis and compliance with state-specific regulations.

In summary, these required metadata fields are not just arbitrary data points; they are carefully selected to provide a comprehensive snapshot of each customer, enabling informed business decisions and effective data analysis.

SMS Conversation Format

Now, let's get specific about the SMS conversation format. This is where we define how SMS data should be structured for optimal processing and analysis. Here's what you need to know:

Input CSV Columns

The input CSV files for SMS data should include the following columns:

conversation_id or opportunity_id or lead_id: A unique identifier for the conversation.
timestamp: The timestamp of the message.
message_text: The content of the SMS message.
is_from_sales_rep (boolean): A flag indicating whether the message was sent by a sales representative.
converted (0 or 1): An indicator of whether the conversation resulted in a conversion.
Required metadata columns: lead_source, debt_amount, credit_score, annual_income, state

These columns provide all the necessary information to reconstruct the conversation and associate it with relevant metadata.

Output Format

The output format for SMS conversations should be a CSV with the following structure:

conversation_id,conversation,converted,metadata,lead_source,debt_amount,credit_score,annual_income,state
"sms_conv_001","[{\"speaker\":\"customer\",\"message\":\"I need debt relief\"},{\"speaker\":\"sales_rep\",\"message\":\"I can help\"}]",1,"{\"lead_source\":\"web\",...\"}","web",15000.00,680,55000.00,"CA"

Let's break down the key columns in this output format:

conversation_id: A unique identifier for the SMS conversation. This helps link related messages together.
conversation: A JSON-formatted string representing the conversation turns. Each turn includes the speaker (customer or sales_rep) and the message text. This format allows for easy parsing and analysis of the conversation flow.
converted: A binary value (0 or 1) indicating whether the conversation led to a conversion. This is a crucial metric for evaluating the effectiveness of SMS communication.
metadata: A JSON-formatted string containing the metadata associated with the conversation. This includes information like lead source, debt amount, credit score, annual income, and state. Storing metadata in JSON format allows for flexibility and easy access to individual fields.
lead_source, debt_amount, credit_score, annual_income, state: These are individual metadata columns, included for easier analysis. Having these fields as separate columns can speed up querying and reporting.

By adhering to this output format, the data formatter module ensures that SMS conversation data is structured consistently, making it easier to process and analyze alongside other data sources.

Deep Dive into Conversation Structure

The conversation column's JSON format is particularly noteworthy. It structures the conversation as an array of turns, where each turn is a dictionary with speaker and message keys. This structure preserves the chronological order of the conversation and clearly identifies who said what.

For example:

[
    {"speaker": "customer", "message": "I need debt relief"},
    {"speaker": "sales_rep", "message": "I can help"}
]

This format is highly flexible and can accommodate conversations of any length. It also allows for easy extraction of individual turns for further analysis, such as sentiment analysis or keyword extraction. The speaker labels (customer and sales_rep) are essential for understanding the roles and perspectives within the conversation.

The inclusion of both the metadata column (as a JSON string) and the individual metadata columns (lead_source, debt_amount, etc.) reflects a balance between flexibility and ease of use. The JSON metadata column allows for the inclusion of additional metadata fields in the future without altering the schema. The individual metadata columns, on the other hand, provide quick access to commonly used metadata fields, improving query performance and simplifying reporting.

In essence, the SMS conversation format is designed to capture the richness and complexity of SMS interactions while ensuring consistency and ease of analysis. This careful structure is a key element in the data formatter module's ability to transform raw data into actionable insights.

Example: `parse_sms_conversations` Implementation

Let's dive into a practical example of how the parse_sms_conversations function can be implemented in Python using the pandas library. This function is the heart of SMS data formatting, and understanding its implementation is crucial.

import pandas as pd
import json

def parse_sms_conversations(sms_df: pd.DataFrame) -> pd.DataFrame:
    """
    Parse SMS messages into conversation format with metadata.
    
    Expected input columns:
    - conversation_id (or opportunity_id/lead_id)
    - timestamp
    - message_text
    - is_from_sales_rep (boolean)
    - converted (0 or 1)
    - REQUIRED metadata: lead_source, debt_amount, credit_score, annual_income, state
    
    Returns:
    - DataFrame with conversation_id, conversation, converted, metadata, individual metadata columns
    """
    # Define required metadata fields
    REQUIRED_METADATA = ['lead_source', 'debt_amount', 'credit_score', 'annual_income', 'state']
    
    # Validate required columns
    required_cols = {'conversation_id', 'timestamp', 'message_text', 'is_from_sales_rep', 'converted'}
    missing_required = required_cols - set(sms_df.columns)
    if missing_required:
        raise ValueError(f"Missing required columns: {missing_required}")
    
    missing_metadata = set(REQUIRED_METADATA) - set(sms_df.columns)
    if missing_metadata:
        raise ValueError(f"Missing required metadata columns: {missing_metadata}")
    
    # Identify all metadata columns (required + optional)
    metadata_cols = [col for col in sms_df.columns if col not in required_cols]
    
    conversations = []
    
    # Group by conversation
    for conv_id, group in sms_df.groupby('conversation_id'):
        # Sort by timestamp
        group = group.sort_values('timestamp')
        
        # Extract metadata from first row
        metadata = {}
        for col in metadata_cols:
            value = group[col].iloc[0]
            
            # Validate required metadata
            if col in REQUIRED_METADATA and pd.isna(value):
                raise ValueError(f"Missing {col} for conversation {conv_id}")
            
            # Type validation
            if col == 'credit_score':
                value = int(value)
                if not (300 <= value <= 850):
                    raise ValueError(f"Invalid credit_score {value} for {conv_id} (must be 300-850)")
            elif col == 'debt_amount':
                value = float(value)
                if value < 0:
                    raise ValueError(f"Invalid debt_amount {value} for {conv_id} (must be >= 0)")
            elif col == 'annual_income':
                value = float(value)
                if value <= 0:
                    raise ValueError(f"Invalid annual_income {value} for {conv_id} (must be > 0)")
            elif col == 'state':
                value = str(value).upper()
                if len(value) != 2:
                    raise ValueError(f"Invalid state {value} for {conv_id} (must be 2-letter code)")
            elif col == 'lead_source':
                value = str(value).lower()
            
            metadata[col] = value
        
        metadata['channel'] = 'sms'
        
        # Build conversation turns
        turns = []
        for _, row in group.iterrows():
            speaker = 'sales_rep' if row['is_from_sales_rep'] else 'customer'
            message = clean_message(row['message_text'])
            
            if message:
                turns.append({
                    'speaker': speaker,
                    'message': message
                })
        
        # Only include conversations with at least 2 turns
        if len(turns) >= 2:
            conv_record = {
                'conversation_id': f"sms_{conv_id}",
                'conversation': json.dumps(turns),
                'converted': int(group['converted'].iloc[0]),
                'metadata': json.dumps(metadata),
                # Also add as individual columns for easy analysis
                'lead_source': metadata['lead_source'],
                'debt_amount': metadata['debt_amount'],
                'credit_score': metadata['credit_score'],
                'annual_income': metadata['annual_income'],
                'state': metadata['state']
            }
            conversations.append(conv_record)
    
    return pd.DataFrame(conversations)

Dissecting the `parse_sms_conversations` Function

Let's break down what's happening in this code snippet. First, the function takes a pandas DataFrame (sms_df) as input, which is expected to contain SMS message data. It then defines a constant REQUIRED_METADATA to list the essential metadata fields.

The function begins by validating that all required columns and metadata fields are present in the input DataFrame. This is a crucial step for data quality. If any required columns are missing, the function raises a ValueError with a descriptive message, preventing further processing and potential errors.

Next, the function identifies all metadata columns, including both required and optional ones. It then initializes an empty list called conversations to store the processed conversation records.

The core logic of the function involves grouping the SMS messages by conversation_id. For each conversation, the messages are sorted by timestamp to maintain the correct order. The metadata for the conversation is extracted from the first row of the group. This assumes that metadata is consistent across all messages within the same conversation. The function then performs type validation and range checks on the metadata values, ensuring that they meet the specified criteria. For example, it checks that credit_score is within the range of 300-850 and that debt_amount is not negative. These checks are vital for data integrity.

After metadata extraction and validation, the function builds the conversation turns. It iterates through each row in the group, determines the speaker (customer or sales_rep), cleans the message text using the clean_message function (which we'll discuss later), and appends the turn to the turns list. Only conversations with at least two turns are included in the output. This filtering step helps ensure that the conversations are meaningful and not just single messages.

Finally, the function constructs a conversation record, which is a dictionary containing the conversation_id, the JSON-formatted conversation turns, the conversion status, the JSON-formatted metadata, and the individual metadata columns. This record is appended to the conversations list. The function returns a pandas DataFrame created from the conversations list. This DataFrame is the formatted output that can be used for further analysis and processing.

This step-by-step breakdown highlights the meticulous nature of the parse_sms_conversations function. It not only transforms the data but also validates it, ensuring that the output is of high quality and ready for use in downstream applications.

Example: `clean_message` Implementation

The clean_message function is a small but mighty tool in our data formatting arsenal. It's responsible for scrubbing the message text, removing noise, and ensuring consistency. Let's take a look at its implementation:

import re
import pandas as pd

def clean_message(message: str) -> str:
    """Clean and normalize message text"""
    if pd.isna(message):
        return ""
    
    # Remove extra whitespace
    message = ' '.join(message.split())
    
    # Remove common noise
    message = re.sub(r'${.*?}{{content}}#39;, '', message)  # Remove [brackets]
    message = re.sub(r'<.*?>', '', message)    # Remove <tags>
    
    return message.strip()

Unpacking the `clean_message` Function

This function takes a single argument, message, which is the text string to be cleaned. The first thing it does is check if the message is null or NaN using pd.isna(message). If it is, the function immediately returns an empty string. This prevents errors and ensures that null messages don't propagate through the system.

The next step is to remove extra whitespace. This is achieved by splitting the message into words using message.split() and then joining them back together with a single space using ' '.join(...). This effectively eliminates any leading, trailing, or multiple spaces within the message.

Then comes the noise removal. The function uses regular expressions (re module) to remove common noise patterns from the message. Specifically, it removes anything enclosed in square brackets ([brackets]) and angle brackets (<tags>). These patterns are often used for metadata or formatting tags within messages, which are not relevant for the core conversation content. The regular expressions r'${.*?}


                    
                        
                            
                        
                        Related Posts

                            
                                Auburn Vs Georgia: A Deep Dive Into This Epic SEC Rivalry
                            
                            
                            	Oct 13, 2025
		                        
									57 views
		                        
                            
                        

                            
                                Portugal National Football Team: History, Players, And More
                            
                            
                            	Oct 18, 2025
		                        
									59 views
		                        
                            
                        

                            
                                Indonesia Vs India: A Deep Dive
                            
                            
                            	Oct 12, 2025
		                        
									31 views
		                        
                            
                        

                            
                                Plant, Animal, And Bacteria Cells: What's The Difference?
                            
                            
                            	Oct 9, 2025
		                        
									57 views
		                        
                            
                        

                            
                                Четырехугольник С Тремя Прямыми Углами: Доказательство
                            
                            
                            	Oct 15, 2025
		                        
									54 views
		                        
                            
                        
                    
                    New Post

                            
                                Opposites & Proverbs: Examples & Completion Exercises
                            
                            
                            	Oct 21, 2025
								
		                            53 views
		                        
                            
                        

                            
                                Excluding Daffodil Versions For Multi-Version Builds
                            
                            
                            	Oct 21, 2025
								
		                            52 views

Data Formatter Module: Implementation Guide

Overview

File Location

Key Functions

Required Metadata Fields

Deep Dive into Required Metadata

SMS Conversation Format

Input CSV Columns

Output Format

Deep Dive into Conversation Structure

Example: `parse_sms_conversations` Implementation

Dissecting the `parse_sms_conversations` Function

Example: `clean_message` Implementation

Unpacking the `clean_message` Function

The Importance of Text Cleaning

Example: `validate_data` Implementation

Deconstructing the `validate_data` Function

The Value of Data Validation

Testing

Acceptance Criteria

Dependencies

Notes

Conclusion

Overview

File Location

Key Functions

Required Metadata Fields

Deep Dive into Required Metadata

SMS Conversation Format

Input CSV Columns

Output Format

Deep Dive into Conversation Structure

Example: parse_sms_conversations Implementation

Dissecting the parse_sms_conversations Function

Example: clean_message Implementation

Unpacking the clean_message Function

The Importance of Text Cleaning

Example: validate_data Implementation

Deconstructing the validate_data Function

The Value of Data Validation

Testing

Acceptance Criteria

Dependencies

Notes

Conclusion

Example: `parse_sms_conversations` Implementation

Dissecting the `parse_sms_conversations` Function

Example: `clean_message` Implementation

Unpacking the `clean_message` Function

Example: `validate_data` Implementation

Deconstructing the `validate_data` Function