Data Formatter Module: Implementation Guide
Hey everyone! Today, we're diving deep into the implementation of a crucial data formatter module (data_formatter.py
). This module's primary job is to transform data from various sources—Salesforce, call centers, and SMS—into a unified DeepMost format, all while ensuring the data's quality through metadata validation. Think of it as the universal translator for your data, making sure everyone's speaking the same language!
Overview
The core objective of this module is to convert and standardize data from diverse sources into the DeepMost format. This involves not just reformatting the data but also validating it against predefined metadata requirements. This ensures that the data fed into subsequent processes is both consistent and reliable. Specifically, we'll be focusing on handling data from three main sources:
- SMS Conversations
- Salesforce Activities
- Call Center Transcripts
File Location
You'll find the module nestled comfortably in the src/data_formatter.py
directory. This location helps maintain a clean project structure and makes it easy to locate and manage the module.
Key Functions
The module is built around several key functions, each designed to handle a specific aspect of data formatting and validation. Let's break down these functions and what they do:
-
parse_sms_conversations(sms_df)
: This function is your go-to for parsing SMS data. It takes in a DataFrame containing SMS messages and transforms them into a structured conversation format. -
parse_salesforce_activities(activities_df)
: Salesforce data can be a beast, but this function tames it! It converts Salesforce opportunity activities into the desired DeepMost format. -
parse_call_transcripts(transcripts_df)
: Got call transcripts? This function is your friend. It parses call transcripts, complete with speaker labels, into a structured format. -
parse_transcript_text(transcript)
: This is a helper function that extracts turns (speaker-message pairs) from a given transcript. It's a crucial step in making sense of the raw transcript text. -
clean_message(message)
: Data cleaning is paramount, and this function is on the front lines. It cleans and normalizes messages by removing noise and formatting inconsistencies. -
validate_data(df)
: Last but not least, this function is the quality control officer. It validates the formatted data to ensure it meets the required standards and metadata specifications. It's like the final exam for your data!
Required Metadata Fields
Before we dive deeper, let's talk metadata. Metadata is data about data, and it's crucial for providing context and ensuring consistency. All formatted data must include these key fields:
Field | Type | Description | Valid Range |
---|---|---|---|
lead_source |
string | Origin of the lead | "web", "referral", "partner", "cold_call" |
debt_amount |
float | Current debt in dollars | ">= 0" |
credit_score |
int | Credit score | 300-850 |
annual_income |
float | Annual income in dollars | "> 0" |
state |
string | US state (2-letter uppercase abbreviation) | 2-letter uppercase |
These metadata fields provide essential context for each data record, allowing for more effective analysis and decision-making. They act as the backbone for understanding the origin and financial status of each lead or customer.
Deep Dive into Required Metadata
Let's zoom in on why these specific metadata fields are so vital. The lead_source
field, for instance, is a cornerstone for marketing analytics. Knowing where a lead originated (web, referral, partner, or cold_call) allows businesses to allocate resources more effectively. If leads from a particular source consistently convert at a higher rate, it makes sense to invest more in that channel.
Financial data points such as debt_amount
and annual_income
are critical for assessing the financial health of a potential customer. These figures help businesses gauge the customer's ability to repay a loan or afford a service. The valid range for debt_amount
being ">= 0" reflects the practical reality that debt cannot be negative. Similarly, annual_income
must be greater than zero, as a negative or zero income is not a viable scenario for most financial assessments.
credit_score
, with its strict range of 300-850, is a key indicator of creditworthiness. This range aligns with standard credit scoring models, and any value outside this range is considered invalid. A customer's credit score is a primary factor in determining loan eligibility and interest rates.
Finally, the state
field, requiring a 2-letter uppercase abbreviation, ensures consistency and accuracy in geographic data. This standardization is crucial for regional analysis and compliance with state-specific regulations.
In summary, these required metadata fields are not just arbitrary data points; they are carefully selected to provide a comprehensive snapshot of each customer, enabling informed business decisions and effective data analysis.
SMS Conversation Format
Now, let's get specific about the SMS conversation format. This is where we define how SMS data should be structured for optimal processing and analysis. Here's what you need to know:
Input CSV Columns
The input CSV files for SMS data should include the following columns:
conversation_id
oropportunity_id
orlead_id
: A unique identifier for the conversation.timestamp
: The timestamp of the message.message_text
: The content of the SMS message.is_from_sales_rep
(boolean): A flag indicating whether the message was sent by a sales representative.converted
(0 or 1): An indicator of whether the conversation resulted in a conversion.- Required metadata columns:
lead_source
,debt_amount
,credit_score
,annual_income
,state
These columns provide all the necessary information to reconstruct the conversation and associate it with relevant metadata.
Output Format
The output format for SMS conversations should be a CSV with the following structure:
conversation_id,conversation,converted,metadata,lead_source,debt_amount,credit_score,annual_income,state
"sms_conv_001","[{\"speaker\":\"customer\",\"message\":\"I need debt relief\"},{\"speaker\":\"sales_rep\",\"message\":\"I can help\"}]",1,"{\"lead_source\":\"web\",...\"}","web",15000.00,680,55000.00,"CA"
Let's break down the key columns in this output format:
conversation_id
: A unique identifier for the SMS conversation. This helps link related messages together.conversation
: A JSON-formatted string representing the conversation turns. Each turn includes the speaker (customer or sales_rep) and the message text. This format allows for easy parsing and analysis of the conversation flow.converted
: A binary value (0 or 1) indicating whether the conversation led to a conversion. This is a crucial metric for evaluating the effectiveness of SMS communication.metadata
: A JSON-formatted string containing the metadata associated with the conversation. This includes information like lead source, debt amount, credit score, annual income, and state. Storing metadata in JSON format allows for flexibility and easy access to individual fields.lead_source
,debt_amount
,credit_score
,annual_income
,state
: These are individual metadata columns, included for easier analysis. Having these fields as separate columns can speed up querying and reporting.
By adhering to this output format, the data formatter module ensures that SMS conversation data is structured consistently, making it easier to process and analyze alongside other data sources.
Deep Dive into Conversation Structure
The conversation
column's JSON format is particularly noteworthy. It structures the conversation as an array of turns, where each turn is a dictionary with speaker
and message
keys. This structure preserves the chronological order of the conversation and clearly identifies who said what.
For example:
[
{"speaker": "customer", "message": "I need debt relief"},
{"speaker": "sales_rep", "message": "I can help"}
]
This format is highly flexible and can accommodate conversations of any length. It also allows for easy extraction of individual turns for further analysis, such as sentiment analysis or keyword extraction. The speaker labels (customer
and sales_rep
) are essential for understanding the roles and perspectives within the conversation.
The inclusion of both the metadata
column (as a JSON string) and the individual metadata columns (lead_source
, debt_amount
, etc.) reflects a balance between flexibility and ease of use. The JSON metadata
column allows for the inclusion of additional metadata fields in the future without altering the schema. The individual metadata columns, on the other hand, provide quick access to commonly used metadata fields, improving query performance and simplifying reporting.
In essence, the SMS conversation format is designed to capture the richness and complexity of SMS interactions while ensuring consistency and ease of analysis. This careful structure is a key element in the data formatter module's ability to transform raw data into actionable insights.
Example: parse_sms_conversations
Implementation
Let's dive into a practical example of how the parse_sms_conversations
function can be implemented in Python using the pandas
library. This function is the heart of SMS data formatting, and understanding its implementation is crucial.
import pandas as pd
import json
def parse_sms_conversations(sms_df: pd.DataFrame) -> pd.DataFrame:
"""
Parse SMS messages into conversation format with metadata.
Expected input columns:
- conversation_id (or opportunity_id/lead_id)
- timestamp
- message_text
- is_from_sales_rep (boolean)
- converted (0 or 1)
- REQUIRED metadata: lead_source, debt_amount, credit_score, annual_income, state
Returns:
- DataFrame with conversation_id, conversation, converted, metadata, individual metadata columns
"""
# Define required metadata fields
REQUIRED_METADATA = ['lead_source', 'debt_amount', 'credit_score', 'annual_income', 'state']
# Validate required columns
required_cols = {'conversation_id', 'timestamp', 'message_text', 'is_from_sales_rep', 'converted'}
missing_required = required_cols - set(sms_df.columns)
if missing_required:
raise ValueError(f"Missing required columns: {missing_required}")
missing_metadata = set(REQUIRED_METADATA) - set(sms_df.columns)
if missing_metadata:
raise ValueError(f"Missing required metadata columns: {missing_metadata}")
# Identify all metadata columns (required + optional)
metadata_cols = [col for col in sms_df.columns if col not in required_cols]
conversations = []
# Group by conversation
for conv_id, group in sms_df.groupby('conversation_id'):
# Sort by timestamp
group = group.sort_values('timestamp')
# Extract metadata from first row
metadata = {}
for col in metadata_cols:
value = group[col].iloc[0]
# Validate required metadata
if col in REQUIRED_METADATA and pd.isna(value):
raise ValueError(f"Missing {col} for conversation {conv_id}")
# Type validation
if col == 'credit_score':
value = int(value)
if not (300 <= value <= 850):
raise ValueError(f"Invalid credit_score {value} for {conv_id} (must be 300-850)")
elif col == 'debt_amount':
value = float(value)
if value < 0:
raise ValueError(f"Invalid debt_amount {value} for {conv_id} (must be >= 0)")
elif col == 'annual_income':
value = float(value)
if value <= 0:
raise ValueError(f"Invalid annual_income {value} for {conv_id} (must be > 0)")
elif col == 'state':
value = str(value).upper()
if len(value) != 2:
raise ValueError(f"Invalid state {value} for {conv_id} (must be 2-letter code)")
elif col == 'lead_source':
value = str(value).lower()
metadata[col] = value
metadata['channel'] = 'sms'
# Build conversation turns
turns = []
for _, row in group.iterrows():
speaker = 'sales_rep' if row['is_from_sales_rep'] else 'customer'
message = clean_message(row['message_text'])
if message:
turns.append({
'speaker': speaker,
'message': message
})
# Only include conversations with at least 2 turns
if len(turns) >= 2:
conv_record = {
'conversation_id': f"sms_{conv_id}",
'conversation': json.dumps(turns),
'converted': int(group['converted'].iloc[0]),
'metadata': json.dumps(metadata),
# Also add as individual columns for easy analysis
'lead_source': metadata['lead_source'],
'debt_amount': metadata['debt_amount'],
'credit_score': metadata['credit_score'],
'annual_income': metadata['annual_income'],
'state': metadata['state']
}
conversations.append(conv_record)
return pd.DataFrame(conversations)
Dissecting the parse_sms_conversations
Function
Let's break down what's happening in this code snippet. First, the function takes a pandas
DataFrame (sms_df
) as input, which is expected to contain SMS message data. It then defines a constant REQUIRED_METADATA
to list the essential metadata fields.
The function begins by validating that all required columns and metadata fields are present in the input DataFrame. This is a crucial step for data quality. If any required columns are missing, the function raises a ValueError
with a descriptive message, preventing further processing and potential errors.
Next, the function identifies all metadata columns, including both required and optional ones. It then initializes an empty list called conversations
to store the processed conversation records.
The core logic of the function involves grouping the SMS messages by conversation_id
. For each conversation, the messages are sorted by timestamp to maintain the correct order. The metadata for the conversation is extracted from the first row of the group. This assumes that metadata is consistent across all messages within the same conversation. The function then performs type validation and range checks on the metadata values, ensuring that they meet the specified criteria. For example, it checks that credit_score
is within the range of 300-850 and that debt_amount
is not negative. These checks are vital for data integrity.
After metadata extraction and validation, the function builds the conversation turns. It iterates through each row in the group, determines the speaker (customer or sales_rep), cleans the message text using the clean_message
function (which we'll discuss later), and appends the turn to the turns
list. Only conversations with at least two turns are included in the output. This filtering step helps ensure that the conversations are meaningful and not just single messages.
Finally, the function constructs a conversation record, which is a dictionary containing the conversation_id
, the JSON-formatted conversation turns, the conversion status, the JSON-formatted metadata, and the individual metadata columns. This record is appended to the conversations
list. The function returns a pandas
DataFrame created from the conversations
list. This DataFrame is the formatted output that can be used for further analysis and processing.
This step-by-step breakdown highlights the meticulous nature of the parse_sms_conversations
function. It not only transforms the data but also validates it, ensuring that the output is of high quality and ready for use in downstream applications.
Example: clean_message
Implementation
The clean_message
function is a small but mighty tool in our data formatting arsenal. It's responsible for scrubbing the message text, removing noise, and ensuring consistency. Let's take a look at its implementation:
import re
import pandas as pd
def clean_message(message: str) -> str:
"""Clean and normalize message text"""
if pd.isna(message):
return ""
# Remove extra whitespace
message = ' '.join(message.split())
# Remove common noise
message = re.sub(r'${.*?}{{content}}#39;, '', message) # Remove [brackets]
message = re.sub(r'<.*?>', '', message) # Remove <tags>
return message.strip()
Unpacking the clean_message
Function
This function takes a single argument, message
, which is the text string to be cleaned. The first thing it does is check if the message is null or NaN
using pd.isna(message)
. If it is, the function immediately returns an empty string. This prevents errors and ensures that null messages don't propagate through the system.
The next step is to remove extra whitespace. This is achieved by splitting the message into words using message.split()
and then joining them back together with a single space using ' '.join(...)
. This effectively eliminates any leading, trailing, or multiple spaces within the message.
Then comes the noise removal. The function uses regular expressions (re
module) to remove common noise patterns from the message. Specifically, it removes anything enclosed in square brackets ([brackets]
) and angle brackets (<tags>
). These patterns are often used for metadata or formatting tags within messages, which are not relevant for the core conversation content. The regular expressions r'${.*?}