165 lines
6.4 KiB
Markdown
165 lines
6.4 KiB
Markdown
---
|
|
phase: 01-statlink-autolinking
|
|
plan: 01
|
|
subsystem: seo
|
|
tags: [statlink, guzzle, scraping, cron, seo-linkbuilding]
|
|
|
|
requires:
|
|
- phase: none
|
|
provides: published articles with wp_post_url
|
|
|
|
provides:
|
|
- StatLink.pl auto-linking service
|
|
- Cron endpoint for link management
|
|
- Link lifecycle tracking (add → expire → remove)
|
|
|
|
affects: [admin-panel, monitoring]
|
|
|
|
tech-stack:
|
|
added: [guzzle-cookiejar, html-scraping]
|
|
patterns: [service-class-per-integration, cron-token-auth, diagnostic-logging]
|
|
|
|
key-files:
|
|
created:
|
|
- src/Services/StatLinkService.php
|
|
- src/Controllers/StatLinkController.php
|
|
- cron/statlink.php
|
|
- migrations/013_statlink_tracking.sql
|
|
modified:
|
|
- config/routes.php
|
|
- src/Core/Controller.php
|
|
|
|
key-decisions:
|
|
- "Cookie-based Guzzle session for StatLink (no API available)"
|
|
- "Anchor sanitization: Polish diacritics → ASCII (StatLink restriction)"
|
|
- "MAX_LINKS_PER_RUN=1 to avoid rate limiting"
|
|
- "ilosc_dziennie=0.02, link lifetime 60 days"
|
|
- "json_encode with JSON_INVALID_UTF8_SUBSTITUTE for scraped HTML safety"
|
|
|
|
patterns-established:
|
|
- "Diagnostic array pattern for debugging external service integrations"
|
|
- "FTP deploy requires OPcache reset for changes to take effect"
|
|
|
|
duration: ~4h (initial build) + 2h (bugfix session 2026-04-09)
|
|
started: 2026-04-08
|
|
completed: 2026-04-09T11:15:00Z
|
|
---
|
|
|
|
# Phase 1 Plan 01: StatLink Auto-Linking Summary
|
|
|
|
**Automated StatLink.pl link management: login, add links for published articles, track lifecycle, remove after 60 days**
|
|
|
|
## Performance
|
|
|
|
| Metric | Value |
|
|
|--------|-------|
|
|
| Duration | ~6h total (build + bugfix) |
|
|
| Started | 2026-04-08 |
|
|
| Completed | 2026-04-09 |
|
|
| Tasks | 3 completed |
|
|
| Files created | 4 |
|
|
| Files modified | 2 |
|
|
|
|
## Acceptance Criteria Results
|
|
|
|
| Criterion | Status | Notes |
|
|
|-----------|--------|-------|
|
|
| AC-1: Login do StatLink | Pass | Guzzle CookieJar, GET homepage + POST login, verified on production |
|
|
| AC-2: Dodawanie linku | Pass | Form POST with CSRF, anchor sanitization, ID extraction from response |
|
|
| AC-3: Usuwanie wygasłych | Pass | POST with statlink_id + usun, status tracking |
|
|
| AC-4: Cron endpoint | Pass | /statlink/token-run with SEO_TRIGGER_TOKEN, also cron/statlink.php |
|
|
| AC-5: Tracking w bazie | Pass | statlink_links table with full lifecycle tracking |
|
|
|
|
## Accomplishments
|
|
|
|
- StatLink service logs in, adds links, removes expired links end-to-end
|
|
- Robust diagnostic logging — every step tracked, errors surfaced in JSON response
|
|
- Retry mechanism for failed links with error tracking in database
|
|
- Token-secured HTTP endpoint + standalone cron script
|
|
|
|
## Files Created/Modified
|
|
|
|
| File | Change | Purpose |
|
|
|------|--------|---------|
|
|
| `src/Services/StatLinkService.php` | Created | Core service: login, addLink, removeLink, processNewArticles, retryFailedLinks, removeExpiredLinks |
|
|
| `src/Controllers/StatLinkController.php` | Created | HTTP endpoints: index (admin view), runByToken (cron trigger) |
|
|
| `cron/statlink.php` | Created | Standalone cron script with lock file |
|
|
| `migrations/013_statlink_tracking.sql` | Created | statlink_links table schema |
|
|
| `config/routes.php` | Modified | Added /statlink routes |
|
|
| `src/Core/Controller.php` | Modified | json_encode with JSON_INVALID_UTF8_SUBSTITUTE |
|
|
|
|
## Decisions Made
|
|
|
|
| Decision | Rationale | Impact |
|
|
|----------|-----------|--------|
|
|
| ASCII-only anchors (transliteration) | StatLink rejects Polish diacritics in anchor field | All anchors auto-sanitized ą→a, ś→s etc. |
|
|
| MAX_LINKS_PER_RUN=1 | Avoid StatLink rate limiting | 1 link per cron run, predictable load |
|
|
| Timeout 120s per request | StatLink is slow | connect_timeout=60s, timeout=120s |
|
|
| set_time_limit(300) | PHP default 30s insufficient | Both controller and cron script |
|
|
| JSON_INVALID_UTF8_SUBSTITUTE | Scraped StatLink HTML contains broken UTF-8 | Prevents empty JSON responses |
|
|
| findLinkIdInHtml before search | Response HTML already contains new link | Reduces requests, more reliable ID detection |
|
|
|
|
## Deviations from Plan
|
|
|
|
### Summary
|
|
|
|
| Type | Count | Impact |
|
|
|------|-------|--------|
|
|
| Auto-fixed | 3 | Critical — without these fixes, no links were being added |
|
|
| Scope additions | 1 | Retry mechanism (not in original plan) |
|
|
| Deferred | 1 | No max retry limit |
|
|
|
|
### Auto-fixed Issues
|
|
|
|
**1. Anchor encoding — StatLink rejects Polish characters**
|
|
- **Found during:** Production testing
|
|
- **Issue:** StatLink form validation requires ASCII-only anchors (alphanumeric + `.,+-_?!&\:=` + space)
|
|
- **Fix:** Added `sanitizeAnchor()` with Polish→ASCII transliteration map
|
|
- **Files:** `src/Services/StatLinkService.php`
|
|
- **Verification:** Links now added successfully with sanitized anchors
|
|
|
|
**2. Empty JSON responses from scraped HTML**
|
|
- **Found during:** Production debugging
|
|
- **Issue:** `json_encode()` returns `false` (output: nothing) when data contains invalid UTF-8 from StatLink HTML
|
|
- **Fix:** Added `JSON_INVALID_UTF8_SUBSTITUTE | JSON_UNESCAPED_UNICODE` flags
|
|
- **Files:** `src/Core/Controller.php`
|
|
- **Verification:** All endpoints return valid JSON
|
|
|
|
**3. StatLink ID not detected after successful add**
|
|
- **Found during:** Production testing
|
|
- **Issue:** `findLinkIdBySearch` made separate request, URL matching was too narrow (no protocol variants, small region)
|
|
- **Fix:** New `findLinkIdInHtml()` extracts ID directly from form response HTML with wider region and URL variants
|
|
- **Files:** `src/Services/StatLinkService.php`
|
|
- **Verification:** `statlink_id=2673465` correctly detected
|
|
|
|
### Deferred Items
|
|
|
|
- No max retry count for permanently failing links (could block queue)
|
|
- StatLink cron not integrated into main publish cron — needs separate cron job setup on server
|
|
|
|
## Issues Encountered
|
|
|
|
| Issue | Resolution |
|
|
|-------|------------|
|
|
| OPcache serving stale files after FTP upload | Manual opcache_reset() via test script; documented in patterns |
|
|
| PHP max_execution_time killing script | Added set_time_limit(300) in controller and cron |
|
|
| Login diagnostics missing on failure | Added loginDiagnostic in all error paths (empty credentials, exceptions) |
|
|
|
|
## Next Phase Readiness
|
|
|
|
**Ready:**
|
|
- StatLink service fully operational, links being added and tracked
|
|
- 37 failed links queued for retry (will auto-process via cron)
|
|
- Admin panel view exists at /statlink
|
|
|
|
**Concerns:**
|
|
- No max retry limit — a permanently failing link blocks the queue
|
|
- Cron not yet configured on server (only manual token URL trigger)
|
|
|
|
**Blockers:**
|
|
- None
|
|
|
|
---
|
|
*Phase: 01-statlink-autolinking, Plan: 01*
|
|
*Completed: 2026-04-09*
|