Files
backPRO/.paul/phases/01-statlink-autolinking/01-01-SUMMARY.md
2026-04-09 11:44:45 +02:00

6.4 KiB

phase, plan, subsystem, tags, requires, provides, affects, tech-stack, key-files, key-decisions, patterns-established, duration, started, completed
phase plan subsystem tags requires provides affects tech-stack key-files key-decisions patterns-established duration started completed
01-statlink-autolinking 01 seo
statlink
guzzle
scraping
cron
seo-linkbuilding
phase provides
none published articles with wp_post_url
StatLink.pl auto-linking service
Cron endpoint for link management
Link lifecycle tracking (add → expire → remove)
admin-panel
monitoring
added patterns
guzzle-cookiejar
html-scraping
service-class-per-integration
cron-token-auth
diagnostic-logging
created modified
src/Services/StatLinkService.php
src/Controllers/StatLinkController.php
cron/statlink.php
migrations/013_statlink_tracking.sql
config/routes.php
src/Core/Controller.php
Cookie-based Guzzle session for StatLink (no API available)
Anchor sanitization: Polish diacritics → ASCII (StatLink restriction)
MAX_LINKS_PER_RUN=1 to avoid rate limiting
ilosc_dziennie=0.02, link lifetime 60 days
json_encode with JSON_INVALID_UTF8_SUBSTITUTE for scraped HTML safety
Diagnostic array pattern for debugging external service integrations
FTP deploy requires OPcache reset for changes to take effect
~4h (initial build) + 2h (bugfix session 2026-04-09) 2026-04-08 2026-04-09T11:15:00Z

Phase 1 Plan 01: StatLink Auto-Linking Summary

Automated StatLink.pl link management: login, add links for published articles, track lifecycle, remove after 60 days

Performance

Metric Value
Duration ~6h total (build + bugfix)
Started 2026-04-08
Completed 2026-04-09
Tasks 3 completed
Files created 4
Files modified 2

Acceptance Criteria Results

Criterion Status Notes
AC-1: Login do StatLink Pass Guzzle CookieJar, GET homepage + POST login, verified on production
AC-2: Dodawanie linku Pass Form POST with CSRF, anchor sanitization, ID extraction from response
AC-3: Usuwanie wygasłych Pass POST with statlink_id + usun, status tracking
AC-4: Cron endpoint Pass /statlink/token-run with SEO_TRIGGER_TOKEN, also cron/statlink.php
AC-5: Tracking w bazie Pass statlink_links table with full lifecycle tracking

Accomplishments

  • StatLink service logs in, adds links, removes expired links end-to-end
  • Robust diagnostic logging — every step tracked, errors surfaced in JSON response
  • Retry mechanism for failed links with error tracking in database
  • Token-secured HTTP endpoint + standalone cron script

Files Created/Modified

File Change Purpose
src/Services/StatLinkService.php Created Core service: login, addLink, removeLink, processNewArticles, retryFailedLinks, removeExpiredLinks
src/Controllers/StatLinkController.php Created HTTP endpoints: index (admin view), runByToken (cron trigger)
cron/statlink.php Created Standalone cron script with lock file
migrations/013_statlink_tracking.sql Created statlink_links table schema
config/routes.php Modified Added /statlink routes
src/Core/Controller.php Modified json_encode with JSON_INVALID_UTF8_SUBSTITUTE

Decisions Made

Decision Rationale Impact
ASCII-only anchors (transliteration) StatLink rejects Polish diacritics in anchor field All anchors auto-sanitized ą→a, ś→s etc.
MAX_LINKS_PER_RUN=1 Avoid StatLink rate limiting 1 link per cron run, predictable load
Timeout 120s per request StatLink is slow connect_timeout=60s, timeout=120s
set_time_limit(300) PHP default 30s insufficient Both controller and cron script
JSON_INVALID_UTF8_SUBSTITUTE Scraped StatLink HTML contains broken UTF-8 Prevents empty JSON responses
findLinkIdInHtml before search Response HTML already contains new link Reduces requests, more reliable ID detection

Deviations from Plan

Summary

Type Count Impact
Auto-fixed 3 Critical — without these fixes, no links were being added
Scope additions 1 Retry mechanism (not in original plan)
Deferred 1 No max retry limit

Auto-fixed Issues

1. Anchor encoding — StatLink rejects Polish characters

  • Found during: Production testing
  • Issue: StatLink form validation requires ASCII-only anchors (alphanumeric + .,+-_?!&\:= + space)
  • Fix: Added sanitizeAnchor() with Polish→ASCII transliteration map
  • Files: src/Services/StatLinkService.php
  • Verification: Links now added successfully with sanitized anchors

2. Empty JSON responses from scraped HTML

  • Found during: Production debugging
  • Issue: json_encode() returns false (output: nothing) when data contains invalid UTF-8 from StatLink HTML
  • Fix: Added JSON_INVALID_UTF8_SUBSTITUTE | JSON_UNESCAPED_UNICODE flags
  • Files: src/Core/Controller.php
  • Verification: All endpoints return valid JSON

3. StatLink ID not detected after successful add

  • Found during: Production testing
  • Issue: findLinkIdBySearch made separate request, URL matching was too narrow (no protocol variants, small region)
  • Fix: New findLinkIdInHtml() extracts ID directly from form response HTML with wider region and URL variants
  • Files: src/Services/StatLinkService.php
  • Verification: statlink_id=2673465 correctly detected

Deferred Items

  • No max retry count for permanently failing links (could block queue)
  • StatLink cron not integrated into main publish cron — needs separate cron job setup on server

Issues Encountered

Issue Resolution
OPcache serving stale files after FTP upload Manual opcache_reset() via test script; documented in patterns
PHP max_execution_time killing script Added set_time_limit(300) in controller and cron
Login diagnostics missing on failure Added loginDiagnostic in all error paths (empty credentials, exceptions)

Next Phase Readiness

Ready:

  • StatLink service fully operational, links being added and tracked
  • 37 failed links queued for retry (will auto-process via cron)
  • Admin panel view exists at /statlink

Concerns:

  • No max retry limit — a permanently failing link blocks the queue
  • Cron not yet configured on server (only manual token URL trigger)

Blockers:

  • None

Phase: 01-statlink-autolinking, Plan: 01 Completed: 2026-04-09