Comprehensive guide to advanced PHP string manipulation techniques including validation, regex patterns, multibyte processing, CSV parsing, and internationalization with complete working code examples.

PHP Problem Solving: Mastering String Manipulation and Text Processing

String manipulation is one of the most fundamental yet powerful aspects of PHP programming. From data validation and sanitization to complex text parsing and transformation, mastering string operations is essential for any PHP developer. This comprehensive tutorial explores advanced string manipulation techniques, practical solutions to common text processing challenges, and performance optimization strategies with complete working code examples.

FOCUS AREA: This tutorial covers multibyte string handling, regex patterns, parsing techniques, validation methods, and practical solutions for real-world string processing challenges with complete code examples.

Problem 1: Advanced String Validation and Sanitization

Challenge: You need to validate and sanitize user input for a multilingual registration form, handling special characters, encoding issues, and preventing injection attacks while preserving legitimate international characters.

<?php /** * Advanced String Validation and Sanitization * Problem: Multilingual input handling with security */ class StringValidator { /** * Validate and sanitize international names */ public function validateName(string $input, int $maxLength = 100): array { $result = ['valid' => false, 'sanitized' => '', 'errors' => []]; // Check encoding if (!mb_check_encoding($input, 'UTF-8')) { $result['errors'][] = 'Invalid character encoding'; return $result; } // Remove zero-width characters and invisible Unicode $sanitized = preg_replace('/[\x{200B}-\x{200D}\x{FEFF}]/u', '', $input); // Normalize Unicode (NFC form) $sanitized = normalizer_normalize($sanitized, Normalizer::FORM_C); // Trim whitespace (including Unicode whitespace) $sanitized = preg_replace('/^\s+|\s+$/u', '', $sanitized); $sanitized = preg_replace('/\s+/u', ' ', $sanitized); // Validate length $length = mb_strlen($sanitized, 'UTF-8'); if ($length === 0) { $result['errors'][] = 'Name cannot be empty'; return $result; } if ($length > $maxLength) { $result['errors'][] = "Name exceeds maximum length of {$maxLength} characters"; return $result; } // Validate characters (letters, spaces, hyphens, apostrophes) if (!preg_match('/^[\p{L}\s\-\'.]+$/u', $sanitized)) { $result['errors'][] = 'Name contains invalid characters'; return $result; } $result['valid'] = true; $result['sanitized'] = $sanitized; return $result; } /** * Validate international email addresses */ public function validateEmail(string $email): array { $result = ['valid' => false, 'normalized' => '', 'errors' => []]; // Convert to lowercase and trim $normalized = mb_strtolower(trim($email), 'UTF-8'); // Check for common injection patterns if (preg_match('/[\r\n\x00]/', $normalized)) { $result['errors'][] = 'Email contains invalid characters'; return $result; } // Validate using filter_var if (!filter_var($normalized, FILTER_VALIDATE_EMAIL)) { $result['errors'][] = 'Invalid email format'; return $result; } // Additional validation: check domain $domain = substr(strrchr($normalized, '@'), 1); if (!checkdnsrr($domain, 'MX') && !checkdnsrr($domain, 'A')) { $result['errors'][] = 'Domain does not have valid MX record'; return $result; } $result['valid'] = true; $result['normalized'] = $normalized; return $result; } /** * Sanitize HTML content while preserving safe tags */ public function sanitizeHtml(string $input, array $allowedTags = []): string { // Convert special characters to HTML entities $sanitized = htmlspecialchars($input, ENT_QUOTES | ENT_HTML5, 'UTF-8'); // If specific tags are allowed, use HTMLPurifier or similar if (!empty($allowedTags)) { // First decode, then filter $decoded = html_entity_decode($sanitized, ENT_QUOTES | ENT_HTML5, 'UTF-8'); $sanitized = strip_tags($decoded, '<' . implode('><', $allowedTags) . '>'); } return $sanitized; } /** * Validate phone numbers with international format */ public function validatePhone(string $phone, string $countryCode = 'US'): array { $result = ['valid' => false, 'normalized' => '', 'errors' => []]; // Remove all non-numeric characters except + $normalized = preg_replace('/[^\d+]/', '', $phone); // Basic length validation $digitsOnly = preg_replace('/\D/', '', $normalized); if (strlen($digitsOnly) < 7 || strlen($digitsOnly) > 15) { $result['errors'][] = 'Invalid phone number length'; return $result; } // Country-specific validation patterns $patterns = [ 'US' => '/^\+?1?\d{10}$/', 'UK' => '/^\+?44\d{10}$/', 'DE' => '/^\+?49\d{11}$/', 'FR' => '/^\+?33\d{9}$/' ]; if (isset($patterns[$countryCode]) && !preg_match($patterns[$countryCode], $digitsOnly)) { $result['errors'][] = "Invalid phone number format for {$countryCode}"; return $result; } $result['valid'] = true; $result['normalized'] = $normalized; return $result; } } // Usage examples $validator = new StringValidator(); // Name validation $nameResult = $validator->validateName('José María O\'Connor-Smith', 50); echo "Name validation: " . ($nameResult['valid'] ? 'Valid' : 'Invalid') . "\n"; if ($nameResult['valid']) { echo "Sanitized: " . $nameResult['sanitized'] . "\n"; } // Email validation $emailResult = $validator->validateEmail('user@example.com'); echo "Email validation: " . ($emailResult['valid'] ? 'Valid' : 'Invalid') . "\n"; // HTML sanitization $safeHtml = $validator->sanitizeHtml('

Hello

', ['p', 'br']); echo "Sanitized HTML: " . $safeHtml . "\n"; ?>

Problem 2: Pattern Matching and Text Extraction

Challenge: You need to extract structured data from unstructured text, including parsing log files, extracting URLs, validating credit card numbers, and processing CSV data with complex quoted fields.

<?php /** * Pattern Matching and Text Extraction * Problem: Extract structured data from unstructured text */ class TextExtractor { /** * Extract all URLs from text with validation */ public function extractUrls(string $text): array { $pattern = '/https?:\/\/[^\s<>"{}|\\^`\[\]]+/i'; preg_match_all($pattern, $text, $matches); $urls = []; foreach ($matches[0] as $url) { // Validate URL structure if (filter_var($url, FILTER_VALIDATE_URL)) { $urls[] = [ 'url' => $url, 'domain' => parse_url($url, PHP_URL_HOST), 'scheme' => parse_url($url, PHP_URL_SCHEME) ]; } } return $urls; } /** * Parse and validate credit card numbers */ public function extractCreditCards(string $text): array { // Pattern to match common credit card formats $pattern = '/\b(?:\d[ -]*?){13,16}\b/'; preg_match_all($pattern, $text, $matches); $cards = []; foreach ($matches[0] as $match) { // Remove spaces and dashes $number = preg_replace('/\D/', '', $match); if ($this->validateLuhn($number)) { $cardType = $this->detectCardType($number); $cards[] = [ 'number' => substr($number, 0, 4) . '****' . substr($number, -4), 'type' => $cardType, 'valid' => true ]; } } return $cards; } /** * Luhn algorithm validation */ private function validateLuhn(string $number): bool { $sum = 0; $alternate = false; for ($i = strlen($number) - 1; $i >= 0; $i--) { $n = (int)$number[$i]; if ($alternate) { $n *= 2; if ($n > 9) { $n -= 9; } } $sum += $n; $alternate = !$alternate; } return $sum % 10 === 0; } /** * Detect credit card type */ private function detectCardType(string $number): string { $patterns = [ 'visa' => '/^4[0-9]{12}(?:[0-9]{3})?$/', 'mastercard' => '/^5[1-5][0-9]{14}$/', 'amex' => '/^3[47][0-9]{13}$/', 'discover' => '/^6(?:011|5[0-9]{2})[0-9]{12}$/' ]; foreach ($patterns as $type => $pattern) { if (preg_match($pattern, $number)) { return $type; } } return 'unknown'; } /** * Parse CSV with quoted fields and embedded commas */ public function parseCsv(string $csvText): array { $lines = explode("\n", trim($csvText)); $data = []; foreach ($lines as $line) { if (empty(trim($line))) continue; $fields = []; $field = ''; $inQuotes = false; $length = strlen($line); for ($i = 0; $i < $length; $i++) { $char = $line[$i]; if ($char === '"') { if ($inQuotes && $i + 1 < $length && $line[$i + 1] === '"') { // Escaped quote $field .= '"'; $i++; } else { $inQuotes = !$inQuotes; } } elseif ($char === ',' && !$inQuotes) { $fields[] = trim($field); $field = ''; } else { $field .= $char; } } $fields[] = trim($field); $data[] = $fields; } return $data; } /** * Extract email addresses with context */ public function extractEmailsWithContext(string $text, int $contextLength = 50): array { $pattern = '/\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b/'; preg_match_all($pattern, $text, $matches, PREG_OFFSET_CAPTURE); $emails = []; foreach ($matches[0] as $match) { $email = $match[0]; $position = $match[1]; // Get context before and after $start = max(0, $position - $contextLength); $length = strlen($email) + (2 * $contextLength); $context = substr($text, $start, $length); $emails[] = [ 'email' => $email, 'position' => $position, 'context' => trim($context) ]; } return $emails; } } // Usage examples $extractor = new TextExtractor(); // URL extraction $text = "Check out https://example.com and http://test.org for more info"; $urls = $extractor->extractUrls($text); print_r($urls); // CSV parsing $csv = 'Name,Age,City "John, Jr.",30,"New York, NY" Jane,25,"Los Angeles, CA"'; $data = $extractor->parseCsv($csv); print_r($data); // Email extraction with context $text = "Contact john@example.com for support or sales@company.org for inquiries"; $emails = $extractor->extractEmailsWithContext($text); print_r($emails); ?>

Key Regex Patterns

URL Matching: Use filter_var() after pattern matching to validate extracted URLs

CSV Parsing: Handle quoted fields carefully - consider using fgetcsv() for file operations

Luhn Algorithm: Essential for credit card validation - catches most transcription errors

Problem 3: Multibyte String Processing and Internationalization

Challenge: You need to handle text processing for international applications, including proper collation, transliteration, string comparison, and length calculations for Asian, Arabic, and European languages.

<?php /** * Multibyte String Processing and Internationalization * Problem: Proper handling of international text */ class InternationalTextProcessor { /** * Safe string truncation for multibyte text */ public function truncate(string $text, int $length, string $suffix = '...'): string { if (mb_strlen($text, 'UTF-8') <= $length) { return $text; } // Account for suffix length $truncateLength = $length - mb_strlen($suffix, 'UTF-8'); $truncated = mb_substr($text, 0, $truncateLength, 'UTF-8'); // Break at word boundary if possible $lastSpace = mb_strrpos($truncated, ' ', 0, 'UTF-8'); if ($lastSpace !== false && $lastSpace > $length * 0.8) { $truncated = mb_substr($truncated, 0, $lastSpace, 'UTF-8'); } return $truncated . $suffix; } /** * Transliterate non-Latin characters to ASCII */ public function transliterate(string $text): string { // Use intl extension if available if (class_exists('Transliterator')) { $transliterator = Transliterator::create('Any-Latin; Latin-ASCII'); return $transliterator->transliterate($text); } // Fallback to iconv $result = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $text); return $result !== false ? $result : ''; } /** * Generate URL-friendly slug */ public function generateSlug(string $text): string { // Transliterate to ASCII $slug = $this->transliterate($text); // Convert to lowercase $slug = mb_strtolower($slug, 'UTF-8'); // Replace non-alphanumeric with hyphens $slug = preg_replace('/[^a-z0-9]+/', '-', $slug); // Remove leading/trailing hyphens $slug = trim($slug, '-'); // Collapse multiple hyphens $slug = preg_replace('/-+/', '-', $slug); return $slug; } /** * Case-insensitive string comparison with collation */ public function compareStrings(string $a, string $b, string $locale = 'en_US'): int { // Set locale for string comparison $oldLocale = setlocale(LC_COLLATE, '0'); setlocale(LC_COLLATE, $locale); $result = strcoll($a, $b); // Restore original locale setlocale(LC_COLLATE, $oldLocale); return $result; } /** * Count words in multibyte text */ public function wordCount(string $text): int { // Normalize text $text = normalizer_normalize($text, Normalizer::FORM_C); // Split on word boundaries $words = preg_split('/\P{L}+/u', $text, -1, PREG_SPLIT_NO_EMPTY); return count($words); } /** * Wrap text at specified width considering multibyte characters */ public function wordWrap(string $text, int $width = 75, string $break = "\n"): string { $lines = []; $words = preg_split('/\s+/u', $text, -1, PREG_SPLIT_NO_EMPTY); $currentLine = ''; foreach ($words as $word) { $wordLength = mb_strlen($word, 'UTF-8'); $lineLength = mb_strlen($currentLine, 'UTF-8'); if ($lineLength + $wordLength + 1 > $width) { $lines[] = $currentLine; $currentLine = $word; } else { $currentLine = $currentLine ? $currentLine . ' ' . $word : $word; } } if ($currentLine !== '') { $lines[] = $currentLine; } return implode($break, $lines); } } // Usage examples $processor = new InternationalTextProcessor(); // Truncate multibyte text $longText = 'This is a very long text that needs to be truncated properly'; echo $processor->truncate($longText, 20) . "\n"; // Generate slug $title = 'Café & Restaurant: Les Délices Français!'; echo $processor->generateSlug($title) . "\n"; // Word count $multilingual = 'Hello 世界 مرحبا'; echo "Word count: " . $processor->wordCount($multilingual) . "\n"; // Word wrap $wrapped = $processor->wordWrap($longText, 20); echo $wrapped . "\n"; ?>

String Processing Best Practices

Modern PHP development requires careful attention to multibyte character handling. Always use mb_* functions for international text, normalize Unicode input to prevent spoofing attacks, and validate encoding before processing. Regular expressions with the /u modifier enable proper Unicode support. Consider using the intl extension for advanced collation and transliteration needs.

Mastering PHP String Manipulation

The string manipulation techniques covered in this tutorial form the foundation of text processing in PHP applications. From security-conscious input validation to sophisticated pattern extraction and international text handling, these skills are essential for building robust, globally-aware web applications.

Remember that string operations often become performance bottlenecks in high-traffic applications. Profile your code, cache results when possible, and choose appropriate algorithms for your data volumes. The security implications of string handling - particularly regarding injection attacks and encoding issues - demand constant vigilance and thorough testing with diverse input types.