Lightweight, extensible, container‑friendly visibility into per‑host processes, resources and anomalies.
Features · Quick Start · API · Architecture · 中文文档
| Process Table | Alert List / Tree (placeholder) |
|---|---|
Screenshots are placeholders – replace images in
docs/with real captures (screenshot_process_list.png,screenshot_alerts.png).
This project provides a minimal yet extensible stack for distributed process monitoring:
- Go Agent: periodically samples local process metadata & metrics using gopsutil.
- Go Server: receives batches, evaluates alert rules, offers REST & Prometheus endpoints, optional PostgreSQL persistence.
- React Web UI: lists processes, active alerts; (charts & advanced UX are extension points).
- Docker / Compose: single‑command sandbox deployment.
Target use cases: lightweight capacity insight, anomaly detection (runaway CPU / memory), baseline for custom SRE tooling, or educational reference architecture.
Current capabilities (✔ implemented / 🔧 partial / 🚧 planned):
| Area | Status | Notes |
|---|---|---|
| Process snapshot (pid, name, cmdline, user, status, tree) | ✔ | Tree endpoint /processes/tree |
| Resource metrics (CPU %, RSS, Mem %, Threads) | ✔ | gopsutil sampling |
| Extended metrics (FD count, net conns, ports, IO bytes) | ✔ | open_fds, net_conns, ports, read_bytes, write_bytes |
| Configurable scrape interval | ✔ | agent.yaml |
| Advanced filtering | ✔ | name / pid / status / cpu_gt / mem_gt / port |
| Trend series API | ✔ | Bucketed averages /processes/{pid}/series |
| Alert engine (threshold + duration) | ✔ | In‑memory + persisted events |
| Alert rule CRUD | ✔ | Requires persistence enabled |
| Alert event persistence | ✔ | Upsert alert_events |
| Prometheus metrics | ✔ | /metrics endpoint |
| PostgreSQL persistence | ✔ | Toggle in server.yaml |
| Start / exit events | 🚧 | PID diffing queued |
| Auth / RBAC | 🚧 | JWT / OIDC middleware |
| Grafana dashboards | 🚧 | Provide sample JSON |
| Multi‑tenant isolation | 🚧 | Add tenant_id columns |
Agent (gopsutil) --> batched JSON --> Server (ingest)
|-- In-memory ring buffer
|-- PostgreSQL (optional)
|-- Alert Engine (rules, window eval)
|-- REST API /metrics
|-- Web UI (React)
|-- Prometheus / Grafana
Data path: Agent collects → sends batch → Server buffers & optionally persists → scheduled evaluation updates alerts → consumers query snapshots, history, or aggregated series.
docker compose up --buildServices:
Test:
curl 'http://localhost:8080/api/v1/processes?agent_id=agent1'cd backend
go run ./cmd/server -config ../server.yamlSeparate terminal:
cd backend
go run ./cmd/agent -config ../agent.yamlcd web
npm install
npm run devConfigure a dev proxy or change fetch base to reach :8080.
agent.yaml
agent_id: agent1
server_url: http://server:8080
interval: 5sserver.yaml
bind_addr: :8080
retention: 1h
eval_interval: 15s
max_snapshots: 7200
persistence: false
db_dsn: postgres://user:pass@postgres:5432/procmon?sslmode=disable| Method | Endpoint | Purpose |
|---|---|---|
| POST | /api/v1/agents/{agentID}/processes | Agent batch upload |
| GET | /api/v1/processes?agent_id=...&name=&pid=&status=&cpu_gt=&mem_gt=&port= | Current snapshot list |
| GET | /api/v1/processes/{pid}/history?agent_id=...&minutes=10 | Raw history points |
| GET | /api/v1/processes/{pid}/series?agent_id=...&from=&to=&step=10s | Bucketed (averaged) trend |
| GET | /api/v1/processes/tree?agent_id=...&include_zombies=0 | Process tree (sorted) |
| GET | /api/v1/alerts | Active firing alerts |
| GET | /api/v1/alert-rules | List alert rules (persistence) |
| POST | /api/v1/alert-rules | Create / upsert rule |
| PUT | /api/v1/alert-rules/{id} | Update rule |
| DELETE | /api/v1/alert-rules/{id} | Delete rule |
| GET | /metrics | Prometheus metrics |
curl -X POST http://localhost:8080/api/v1/agents/agent1/processes \
-H 'Content-Type: application/json' \
-d '{"agent_id":"agent1","interval_s":5,"samples":[{"pid":1234,"ppid":1,"name":"demo","cmdline":"/usr/bin/demo","username":"root","status":"R","cpu_percent":12.5,"memory_rss":2048000,"memory_percent":0.4,"num_threads":5}]}'curl 'http://localhost:8080/api/v1/processes?agent_id=agent1&cpu_gt=50'curl -X POST http://localhost:8080/api/v1/alert-rules \
-H 'Content-Type: application/json' \
-d '{"id":"rule_high_cpu","name":"High CPU","metric":"cpu_percent","operator":">","threshold":80,"duration":"60s","enabled":true}'curl 'http://localhost:8080/api/v1/processes/1234/series?agent_id=agent1&step=15s&from=$(date -u -d "5 min ago" +%Y-%m-%dT%H:%M:%SZ)'{
"id": "rule_high_cpu",
"name": "High CPU",
"process_name": "",
"pid": 1234,
"metric": "cpu_percent",
"operator": ">",
"threshold": 80,
"duration": "60s",
"enabled": true
}Returns averaged bucket points; empty buckets omitted.
| Goal | Hook / File | Approach |
|---|---|---|
| Add metric | collector/collector.go | Append field to ProcessSnapshot |
| Persist field | sqlstore/sqlstore.go | ALTER TABLE + CopyFrom columns |
| New alert metric | alerts.go | Extend metricValue |
| Auth | api.go router | Add JWT / OIDC middleware |
| Start/exit events | ingestion diff | Track prior PID set per agent |
| Grafana | /metrics or SQL | Export gauges / build dashboards |
max_snapshotsbounds memory (ring buffer). Reduce for very high scrape rates.- CPU percent depends on sampling cadence; dual-sample improvement optional.
- Series endpoint intentionally simple—integrate TSDB for richer queries.
- Avoid high-cardinality Prometheus labels (per-PID) unless filtered.
- Query endpoint for persisted alert events.
- PID start/exit event stream & webhook.
- UI charts & rule management views.
- Advanced rule types (rate, absence, zombie detection).
- Multi-tenant + auth.
MIT (add LICENSE before publishing publicly).
Chinese version: see README.zh-CN.md.