--- phase: 01-statlink-autolinking plan: 01 subsystem: seo tags: [statlink, guzzle, scraping, cron, seo-linkbuilding] requires: - phase: none provides: published articles with wp_post_url provides: - StatLink.pl auto-linking service - Cron endpoint for link management - Link lifecycle tracking (add → expire → remove) affects: [admin-panel, monitoring] tech-stack: added: [guzzle-cookiejar, html-scraping] patterns: [service-class-per-integration, cron-token-auth, diagnostic-logging] key-files: created: - src/Services/StatLinkService.php - src/Controllers/StatLinkController.php - cron/statlink.php - migrations/013_statlink_tracking.sql modified: - config/routes.php - src/Core/Controller.php key-decisions: - "Cookie-based Guzzle session for StatLink (no API available)" - "Anchor sanitization: Polish diacritics → ASCII (StatLink restriction)" - "MAX_LINKS_PER_RUN=1 to avoid rate limiting" - "ilosc_dziennie=0.02, link lifetime 60 days" - "json_encode with JSON_INVALID_UTF8_SUBSTITUTE for scraped HTML safety" patterns-established: - "Diagnostic array pattern for debugging external service integrations" - "FTP deploy requires OPcache reset for changes to take effect" duration: ~4h (initial build) + 2h (bugfix session 2026-04-09) started: 2026-04-08 completed: 2026-04-09T11:15:00Z --- # Phase 1 Plan 01: StatLink Auto-Linking Summary **Automated StatLink.pl link management: login, add links for published articles, track lifecycle, remove after 60 days** ## Performance | Metric | Value | |--------|-------| | Duration | ~6h total (build + bugfix) | | Started | 2026-04-08 | | Completed | 2026-04-09 | | Tasks | 3 completed | | Files created | 4 | | Files modified | 2 | ## Acceptance Criteria Results | Criterion | Status | Notes | |-----------|--------|-------| | AC-1: Login do StatLink | Pass | Guzzle CookieJar, GET homepage + POST login, verified on production | | AC-2: Dodawanie linku | Pass | Form POST with CSRF, anchor sanitization, ID extraction from response | | AC-3: Usuwanie wygasłych | Pass | POST with statlink_id + usun, status tracking | | AC-4: Cron endpoint | Pass | /statlink/token-run with SEO_TRIGGER_TOKEN, also cron/statlink.php | | AC-5: Tracking w bazie | Pass | statlink_links table with full lifecycle tracking | ## Accomplishments - StatLink service logs in, adds links, removes expired links end-to-end - Robust diagnostic logging — every step tracked, errors surfaced in JSON response - Retry mechanism for failed links with error tracking in database - Token-secured HTTP endpoint + standalone cron script ## Files Created/Modified | File | Change | Purpose | |------|--------|---------| | `src/Services/StatLinkService.php` | Created | Core service: login, addLink, removeLink, processNewArticles, retryFailedLinks, removeExpiredLinks | | `src/Controllers/StatLinkController.php` | Created | HTTP endpoints: index (admin view), runByToken (cron trigger) | | `cron/statlink.php` | Created | Standalone cron script with lock file | | `migrations/013_statlink_tracking.sql` | Created | statlink_links table schema | | `config/routes.php` | Modified | Added /statlink routes | | `src/Core/Controller.php` | Modified | json_encode with JSON_INVALID_UTF8_SUBSTITUTE | ## Decisions Made | Decision | Rationale | Impact | |----------|-----------|--------| | ASCII-only anchors (transliteration) | StatLink rejects Polish diacritics in anchor field | All anchors auto-sanitized ą→a, ś→s etc. | | MAX_LINKS_PER_RUN=1 | Avoid StatLink rate limiting | 1 link per cron run, predictable load | | Timeout 120s per request | StatLink is slow | connect_timeout=60s, timeout=120s | | set_time_limit(300) | PHP default 30s insufficient | Both controller and cron script | | JSON_INVALID_UTF8_SUBSTITUTE | Scraped StatLink HTML contains broken UTF-8 | Prevents empty JSON responses | | findLinkIdInHtml before search | Response HTML already contains new link | Reduces requests, more reliable ID detection | ## Deviations from Plan ### Summary | Type | Count | Impact | |------|-------|--------| | Auto-fixed | 3 | Critical — without these fixes, no links were being added | | Scope additions | 1 | Retry mechanism (not in original plan) | | Deferred | 1 | No max retry limit | ### Auto-fixed Issues **1. Anchor encoding — StatLink rejects Polish characters** - **Found during:** Production testing - **Issue:** StatLink form validation requires ASCII-only anchors (alphanumeric + `.,+-_?!&\:=` + space) - **Fix:** Added `sanitizeAnchor()` with Polish→ASCII transliteration map - **Files:** `src/Services/StatLinkService.php` - **Verification:** Links now added successfully with sanitized anchors **2. Empty JSON responses from scraped HTML** - **Found during:** Production debugging - **Issue:** `json_encode()` returns `false` (output: nothing) when data contains invalid UTF-8 from StatLink HTML - **Fix:** Added `JSON_INVALID_UTF8_SUBSTITUTE | JSON_UNESCAPED_UNICODE` flags - **Files:** `src/Core/Controller.php` - **Verification:** All endpoints return valid JSON **3. StatLink ID not detected after successful add** - **Found during:** Production testing - **Issue:** `findLinkIdBySearch` made separate request, URL matching was too narrow (no protocol variants, small region) - **Fix:** New `findLinkIdInHtml()` extracts ID directly from form response HTML with wider region and URL variants - **Files:** `src/Services/StatLinkService.php` - **Verification:** `statlink_id=2673465` correctly detected ### Deferred Items - No max retry count for permanently failing links (could block queue) - StatLink cron not integrated into main publish cron — needs separate cron job setup on server ## Issues Encountered | Issue | Resolution | |-------|------------| | OPcache serving stale files after FTP upload | Manual opcache_reset() via test script; documented in patterns | | PHP max_execution_time killing script | Added set_time_limit(300) in controller and cron | | Login diagnostics missing on failure | Added loginDiagnostic in all error paths (empty credentials, exceptions) | ## Next Phase Readiness **Ready:** - StatLink service fully operational, links being added and tracked - 37 failed links queued for retry (will auto-process via cron) - Admin panel view exists at /statlink **Concerns:** - No max retry limit — a permanently failing link blocks the queue - Cron not yet configured on server (only manual token URL trigger) **Blockers:** - None --- *Phase: 01-statlink-autolinking, Plan: 01* *Completed: 2026-04-09*