update
This commit is contained in:
164
.paul/phases/01-statlink-autolinking/01-01-SUMMARY.md
Normal file
164
.paul/phases/01-statlink-autolinking/01-01-SUMMARY.md
Normal file
@@ -0,0 +1,164 @@
|
||||
---
|
||||
phase: 01-statlink-autolinking
|
||||
plan: 01
|
||||
subsystem: seo
|
||||
tags: [statlink, guzzle, scraping, cron, seo-linkbuilding]
|
||||
|
||||
requires:
|
||||
- phase: none
|
||||
provides: published articles with wp_post_url
|
||||
|
||||
provides:
|
||||
- StatLink.pl auto-linking service
|
||||
- Cron endpoint for link management
|
||||
- Link lifecycle tracking (add → expire → remove)
|
||||
|
||||
affects: [admin-panel, monitoring]
|
||||
|
||||
tech-stack:
|
||||
added: [guzzle-cookiejar, html-scraping]
|
||||
patterns: [service-class-per-integration, cron-token-auth, diagnostic-logging]
|
||||
|
||||
key-files:
|
||||
created:
|
||||
- src/Services/StatLinkService.php
|
||||
- src/Controllers/StatLinkController.php
|
||||
- cron/statlink.php
|
||||
- migrations/013_statlink_tracking.sql
|
||||
modified:
|
||||
- config/routes.php
|
||||
- src/Core/Controller.php
|
||||
|
||||
key-decisions:
|
||||
- "Cookie-based Guzzle session for StatLink (no API available)"
|
||||
- "Anchor sanitization: Polish diacritics → ASCII (StatLink restriction)"
|
||||
- "MAX_LINKS_PER_RUN=1 to avoid rate limiting"
|
||||
- "ilosc_dziennie=0.02, link lifetime 60 days"
|
||||
- "json_encode with JSON_INVALID_UTF8_SUBSTITUTE for scraped HTML safety"
|
||||
|
||||
patterns-established:
|
||||
- "Diagnostic array pattern for debugging external service integrations"
|
||||
- "FTP deploy requires OPcache reset for changes to take effect"
|
||||
|
||||
duration: ~4h (initial build) + 2h (bugfix session 2026-04-09)
|
||||
started: 2026-04-08
|
||||
completed: 2026-04-09T11:15:00Z
|
||||
---
|
||||
|
||||
# Phase 1 Plan 01: StatLink Auto-Linking Summary
|
||||
|
||||
**Automated StatLink.pl link management: login, add links for published articles, track lifecycle, remove after 60 days**
|
||||
|
||||
## Performance
|
||||
|
||||
| Metric | Value |
|
||||
|--------|-------|
|
||||
| Duration | ~6h total (build + bugfix) |
|
||||
| Started | 2026-04-08 |
|
||||
| Completed | 2026-04-09 |
|
||||
| Tasks | 3 completed |
|
||||
| Files created | 4 |
|
||||
| Files modified | 2 |
|
||||
|
||||
## Acceptance Criteria Results
|
||||
|
||||
| Criterion | Status | Notes |
|
||||
|-----------|--------|-------|
|
||||
| AC-1: Login do StatLink | Pass | Guzzle CookieJar, GET homepage + POST login, verified on production |
|
||||
| AC-2: Dodawanie linku | Pass | Form POST with CSRF, anchor sanitization, ID extraction from response |
|
||||
| AC-3: Usuwanie wygasłych | Pass | POST with statlink_id + usun, status tracking |
|
||||
| AC-4: Cron endpoint | Pass | /statlink/token-run with SEO_TRIGGER_TOKEN, also cron/statlink.php |
|
||||
| AC-5: Tracking w bazie | Pass | statlink_links table with full lifecycle tracking |
|
||||
|
||||
## Accomplishments
|
||||
|
||||
- StatLink service logs in, adds links, removes expired links end-to-end
|
||||
- Robust diagnostic logging — every step tracked, errors surfaced in JSON response
|
||||
- Retry mechanism for failed links with error tracking in database
|
||||
- Token-secured HTTP endpoint + standalone cron script
|
||||
|
||||
## Files Created/Modified
|
||||
|
||||
| File | Change | Purpose |
|
||||
|------|--------|---------|
|
||||
| `src/Services/StatLinkService.php` | Created | Core service: login, addLink, removeLink, processNewArticles, retryFailedLinks, removeExpiredLinks |
|
||||
| `src/Controllers/StatLinkController.php` | Created | HTTP endpoints: index (admin view), runByToken (cron trigger) |
|
||||
| `cron/statlink.php` | Created | Standalone cron script with lock file |
|
||||
| `migrations/013_statlink_tracking.sql` | Created | statlink_links table schema |
|
||||
| `config/routes.php` | Modified | Added /statlink routes |
|
||||
| `src/Core/Controller.php` | Modified | json_encode with JSON_INVALID_UTF8_SUBSTITUTE |
|
||||
|
||||
## Decisions Made
|
||||
|
||||
| Decision | Rationale | Impact |
|
||||
|----------|-----------|--------|
|
||||
| ASCII-only anchors (transliteration) | StatLink rejects Polish diacritics in anchor field | All anchors auto-sanitized ą→a, ś→s etc. |
|
||||
| MAX_LINKS_PER_RUN=1 | Avoid StatLink rate limiting | 1 link per cron run, predictable load |
|
||||
| Timeout 120s per request | StatLink is slow | connect_timeout=60s, timeout=120s |
|
||||
| set_time_limit(300) | PHP default 30s insufficient | Both controller and cron script |
|
||||
| JSON_INVALID_UTF8_SUBSTITUTE | Scraped StatLink HTML contains broken UTF-8 | Prevents empty JSON responses |
|
||||
| findLinkIdInHtml before search | Response HTML already contains new link | Reduces requests, more reliable ID detection |
|
||||
|
||||
## Deviations from Plan
|
||||
|
||||
### Summary
|
||||
|
||||
| Type | Count | Impact |
|
||||
|------|-------|--------|
|
||||
| Auto-fixed | 3 | Critical — without these fixes, no links were being added |
|
||||
| Scope additions | 1 | Retry mechanism (not in original plan) |
|
||||
| Deferred | 1 | No max retry limit |
|
||||
|
||||
### Auto-fixed Issues
|
||||
|
||||
**1. Anchor encoding — StatLink rejects Polish characters**
|
||||
- **Found during:** Production testing
|
||||
- **Issue:** StatLink form validation requires ASCII-only anchors (alphanumeric + `.,+-_?!&\:=` + space)
|
||||
- **Fix:** Added `sanitizeAnchor()` with Polish→ASCII transliteration map
|
||||
- **Files:** `src/Services/StatLinkService.php`
|
||||
- **Verification:** Links now added successfully with sanitized anchors
|
||||
|
||||
**2. Empty JSON responses from scraped HTML**
|
||||
- **Found during:** Production debugging
|
||||
- **Issue:** `json_encode()` returns `false` (output: nothing) when data contains invalid UTF-8 from StatLink HTML
|
||||
- **Fix:** Added `JSON_INVALID_UTF8_SUBSTITUTE | JSON_UNESCAPED_UNICODE` flags
|
||||
- **Files:** `src/Core/Controller.php`
|
||||
- **Verification:** All endpoints return valid JSON
|
||||
|
||||
**3. StatLink ID not detected after successful add**
|
||||
- **Found during:** Production testing
|
||||
- **Issue:** `findLinkIdBySearch` made separate request, URL matching was too narrow (no protocol variants, small region)
|
||||
- **Fix:** New `findLinkIdInHtml()` extracts ID directly from form response HTML with wider region and URL variants
|
||||
- **Files:** `src/Services/StatLinkService.php`
|
||||
- **Verification:** `statlink_id=2673465` correctly detected
|
||||
|
||||
### Deferred Items
|
||||
|
||||
- No max retry count for permanently failing links (could block queue)
|
||||
- StatLink cron not integrated into main publish cron — needs separate cron job setup on server
|
||||
|
||||
## Issues Encountered
|
||||
|
||||
| Issue | Resolution |
|
||||
|-------|------------|
|
||||
| OPcache serving stale files after FTP upload | Manual opcache_reset() via test script; documented in patterns |
|
||||
| PHP max_execution_time killing script | Added set_time_limit(300) in controller and cron |
|
||||
| Login diagnostics missing on failure | Added loginDiagnostic in all error paths (empty credentials, exceptions) |
|
||||
|
||||
## Next Phase Readiness
|
||||
|
||||
**Ready:**
|
||||
- StatLink service fully operational, links being added and tracked
|
||||
- 37 failed links queued for retry (will auto-process via cron)
|
||||
- Admin panel view exists at /statlink
|
||||
|
||||
**Concerns:**
|
||||
- No max retry limit — a permanently failing link blocks the queue
|
||||
- Cron not yet configured on server (only manual token URL trigger)
|
||||
|
||||
**Blockers:**
|
||||
- None
|
||||
|
||||
---
|
||||
*Phase: 01-statlink-autolinking, Plan: 01*
|
||||
*Completed: 2026-04-09*
|
||||
Reference in New Issue
Block a user