Files
backPRO/.paul/phases/01-statlink-autolinking/01-01-SUMMARY.md
2026-04-09 11:44:45 +02:00

165 lines
6.4 KiB
Markdown

---
phase: 01-statlink-autolinking
plan: 01
subsystem: seo
tags: [statlink, guzzle, scraping, cron, seo-linkbuilding]
requires:
- phase: none
provides: published articles with wp_post_url
provides:
- StatLink.pl auto-linking service
- Cron endpoint for link management
- Link lifecycle tracking (add → expire → remove)
affects: [admin-panel, monitoring]
tech-stack:
added: [guzzle-cookiejar, html-scraping]
patterns: [service-class-per-integration, cron-token-auth, diagnostic-logging]
key-files:
created:
- src/Services/StatLinkService.php
- src/Controllers/StatLinkController.php
- cron/statlink.php
- migrations/013_statlink_tracking.sql
modified:
- config/routes.php
- src/Core/Controller.php
key-decisions:
- "Cookie-based Guzzle session for StatLink (no API available)"
- "Anchor sanitization: Polish diacritics → ASCII (StatLink restriction)"
- "MAX_LINKS_PER_RUN=1 to avoid rate limiting"
- "ilosc_dziennie=0.02, link lifetime 60 days"
- "json_encode with JSON_INVALID_UTF8_SUBSTITUTE for scraped HTML safety"
patterns-established:
- "Diagnostic array pattern for debugging external service integrations"
- "FTP deploy requires OPcache reset for changes to take effect"
duration: ~4h (initial build) + 2h (bugfix session 2026-04-09)
started: 2026-04-08
completed: 2026-04-09T11:15:00Z
---
# Phase 1 Plan 01: StatLink Auto-Linking Summary
**Automated StatLink.pl link management: login, add links for published articles, track lifecycle, remove after 60 days**
## Performance
| Metric | Value |
|--------|-------|
| Duration | ~6h total (build + bugfix) |
| Started | 2026-04-08 |
| Completed | 2026-04-09 |
| Tasks | 3 completed |
| Files created | 4 |
| Files modified | 2 |
## Acceptance Criteria Results
| Criterion | Status | Notes |
|-----------|--------|-------|
| AC-1: Login do StatLink | Pass | Guzzle CookieJar, GET homepage + POST login, verified on production |
| AC-2: Dodawanie linku | Pass | Form POST with CSRF, anchor sanitization, ID extraction from response |
| AC-3: Usuwanie wygasłych | Pass | POST with statlink_id + usun, status tracking |
| AC-4: Cron endpoint | Pass | /statlink/token-run with SEO_TRIGGER_TOKEN, also cron/statlink.php |
| AC-5: Tracking w bazie | Pass | statlink_links table with full lifecycle tracking |
## Accomplishments
- StatLink service logs in, adds links, removes expired links end-to-end
- Robust diagnostic logging — every step tracked, errors surfaced in JSON response
- Retry mechanism for failed links with error tracking in database
- Token-secured HTTP endpoint + standalone cron script
## Files Created/Modified
| File | Change | Purpose |
|------|--------|---------|
| `src/Services/StatLinkService.php` | Created | Core service: login, addLink, removeLink, processNewArticles, retryFailedLinks, removeExpiredLinks |
| `src/Controllers/StatLinkController.php` | Created | HTTP endpoints: index (admin view), runByToken (cron trigger) |
| `cron/statlink.php` | Created | Standalone cron script with lock file |
| `migrations/013_statlink_tracking.sql` | Created | statlink_links table schema |
| `config/routes.php` | Modified | Added /statlink routes |
| `src/Core/Controller.php` | Modified | json_encode with JSON_INVALID_UTF8_SUBSTITUTE |
## Decisions Made
| Decision | Rationale | Impact |
|----------|-----------|--------|
| ASCII-only anchors (transliteration) | StatLink rejects Polish diacritics in anchor field | All anchors auto-sanitized ą→a, ś→s etc. |
| MAX_LINKS_PER_RUN=1 | Avoid StatLink rate limiting | 1 link per cron run, predictable load |
| Timeout 120s per request | StatLink is slow | connect_timeout=60s, timeout=120s |
| set_time_limit(300) | PHP default 30s insufficient | Both controller and cron script |
| JSON_INVALID_UTF8_SUBSTITUTE | Scraped StatLink HTML contains broken UTF-8 | Prevents empty JSON responses |
| findLinkIdInHtml before search | Response HTML already contains new link | Reduces requests, more reliable ID detection |
## Deviations from Plan
### Summary
| Type | Count | Impact |
|------|-------|--------|
| Auto-fixed | 3 | Critical — without these fixes, no links were being added |
| Scope additions | 1 | Retry mechanism (not in original plan) |
| Deferred | 1 | No max retry limit |
### Auto-fixed Issues
**1. Anchor encoding — StatLink rejects Polish characters**
- **Found during:** Production testing
- **Issue:** StatLink form validation requires ASCII-only anchors (alphanumeric + `.,+-_?!&\:=` + space)
- **Fix:** Added `sanitizeAnchor()` with Polish→ASCII transliteration map
- **Files:** `src/Services/StatLinkService.php`
- **Verification:** Links now added successfully with sanitized anchors
**2. Empty JSON responses from scraped HTML**
- **Found during:** Production debugging
- **Issue:** `json_encode()` returns `false` (output: nothing) when data contains invalid UTF-8 from StatLink HTML
- **Fix:** Added `JSON_INVALID_UTF8_SUBSTITUTE | JSON_UNESCAPED_UNICODE` flags
- **Files:** `src/Core/Controller.php`
- **Verification:** All endpoints return valid JSON
**3. StatLink ID not detected after successful add**
- **Found during:** Production testing
- **Issue:** `findLinkIdBySearch` made separate request, URL matching was too narrow (no protocol variants, small region)
- **Fix:** New `findLinkIdInHtml()` extracts ID directly from form response HTML with wider region and URL variants
- **Files:** `src/Services/StatLinkService.php`
- **Verification:** `statlink_id=2673465` correctly detected
### Deferred Items
- No max retry count for permanently failing links (could block queue)
- StatLink cron not integrated into main publish cron — needs separate cron job setup on server
## Issues Encountered
| Issue | Resolution |
|-------|------------|
| OPcache serving stale files after FTP upload | Manual opcache_reset() via test script; documented in patterns |
| PHP max_execution_time killing script | Added set_time_limit(300) in controller and cron |
| Login diagnostics missing on failure | Added loginDiagnostic in all error paths (empty credentials, exceptions) |
## Next Phase Readiness
**Ready:**
- StatLink service fully operational, links being added and tracked
- 37 failed links queued for retry (will auto-process via cron)
- Admin panel view exists at /statlink
**Concerns:**
- No max retry limit — a permanently failing link blocks the queue
- Cron not yet configured on server (only manual token URL trigger)
**Blockers:**
- None
---
*Phase: 01-statlink-autolinking, Plan: 01*
*Completed: 2026-04-09*