I'm working on a system which I'd like to:
- Act as an enterprise-wide proxy with thousands of concurrent users.
- Log sites accessed by users/machines along with times and HTTP response codes (at a minimum.)
- Update a local archive.org style archive in a space-efficient manner, wherein subsequent requests for the same resources can be filled locally within a predefined or sitemap.xml-based timeout to limit external network load created by client traffic.
- Provide for filtering with whitelisted+blacklisted sites, keywords, etc.
- Provide an internal archive.org style archive retrieval site to look back at specific versions of already-retrieved content (e.g. when news articles change or similar, with exclusions to avoid caching social media and similar posts.)