Enterprise-wide Web Archiving, Filtering, Logging Proxy



Reaction score: 8
Messages: 83

I'm working on a system which I'd like to:

  • Act as an enterprise-wide proxy with thousands of concurrent users.
  • Log sites accessed by users/machines along with times and HTTP response codes (at a minimum.)
  • Update a local archive.org style archive in a space-efficient manner, wherein subsequent requests for the same resources can be filled locally within a predefined or sitemap.xml-based timeout to limit external network load created by client traffic.
  • Provide for filtering with whitelisted+blacklisted sites, keywords, etc.
  • Provide an internal archive.org style archive retrieval site to look back at specific versions of already-retrieved content (e.g. when news articles change or similar, with exclusions to avoid caching social media and similar posts.)
So far it looks like I can use E2guardian for filtering, and Squid for the proxy itself, but I'm a bit stuck on the archival functionality described. Is there a pre-existing tool which can be plugged in (ideally integrated with Squid directly since it does some caching itself, to avoid duplicated work,) to provide the archival functions described or something close to them which could be extended?