From Hyphens to Pipes: Surviving the Delimiter Nightmare in Ledger Labels

A dash or a pipe, a tilde just right—delimiters keep data in sight.

November 15, 2025 by

Frex Cuadillera

In my recent personal project, I started creating a vendor–product master dataset from general ledger labels that used hyphens (-) as separators. While the ETL worked fine, splitting these labels caused a cascading nightmare because both vendor and product names often contain hyphens.

The Problem

Example 1 (vendor name with hyphen):
- Label: "7-Eleven - Crunch Time"
- Vendor: "7-Eleven"
- Product: "Crunch Time"
- Issue: The value "7" is captured as the vendor.
Example 2 (product name with hyphen):
- Label: "Uniqlo - AIRism Cotton Crew Neck Oversized T-Shirt"
- Vendor: "Uniqlo"
- Product: "AIRism Cotton Crew Neck Oversized T-Shirt"
- Issue: An unwanted third column with the value "Shirt" is added.
Example 3 (both vendor and product name with hyphen):
- Label: "Coca-Cola - Zero-Sugar 500ml"
- Vendor: "Coca-Cola"
- Product: "Zero-Sugar 500ml"
- Issue: Combination of issue 1 and 2.

Naively splitting on hyphens gave incorrect fields:

"7-Eleven - Crunch Time" → "7", "Eleven", "Crunch Time"

"Uniqlo - AIRism Cotton Crew Neck Oversized T-Shirt" -> "Uniqlo", "AIRism Cotton Crew Neck Oversized T", "Shirt"

"Coca-Cola - Zero-Sugar 500ml" → "Coca", "Cola ", "Zero", "Sugar 500ml"

Temporary Fix: Regex Replacement

To handle most labels automatically, I wrote a regex that replaces only the first hyphen in each line with a pipe (|), leaving hyphens inside product names untouched.

/(?<=^[^-]*)-/g

Matches the first hyphen in the label
Leaves all other hyphens intact
Automates the majority of cases

Example:

Starbucks Coffee - Caramel Macchiato Grande → Starbucks Coffee | Caramel

This automated most cases while allowing manual fixes for exceptions.

Example exceptions (replace the 2nd hyphen delimiter):

7-Eleven - Crunch Time → 7-Eleven | Crunch Time Coca-Cola - Zero-Sugar 500ml → Coca-Cola | Zero-Sugar 500ml

This hybrid approach allowed me to generate the vendor–product master dataset reliably while handling edge cases manually.

Choosing the right delimiter like a pro: Google uses ~~~ to keep their master data clean.

Interestingly, I had a chance to work on one of the datasets created by Google, and they had already solved this problem in a smart way.

Instead of hyphens, commas, or pipes, they used a triple tilde (~~~) as the delimiter, which ensured that:

The delimiter was guaranteed never to appear in the data
Parsing and ETL processes remained clean and error-free

Final Thoughts

Choosing the right delimiter might seem like a small decision, but it can make or break your data workflow. A carefully selected delimiter ensures that your fields are parsed correctly, prevents data corruption, and saves hours of troubleshooting. Whether you use a pipe (|), tab (\t), tilde (~), or even a multi-character string, the key is to pick one that will never appear in your actual data. Good delimiter choices make your datasets clean, scalable, and future-proof.

Seeing Google’s approach with ~~~ reinforced the lesson: a good delimiter is simple, unambiguous, and future-proof.

In data engineering, even small decisions like choosing a delimiter can have cascading consequences. Solving them carefully is part of the craft.

#MasterData #DataEngineering #ETL #DataManagement #DataProcessing #DataParsing #DataCleaning #GoogleDataset #Delimiter #DataTips #VendorProductData #FlatFile

in Tech

# Technology