On Tuesday, September 13, 2016 at 8:19:03 AM UTC-7, Ryan Sleevi wrote: > On Tuesday, September 13, 2016 at 7:56:20 AM UTC-7, Peter Bowen wrote: > > I would be careful reading too much into server names. > > mail.[example.com] might host web based email access. For example, > > I'm typing this into a site called mail.google.com :) > > Apologies that the conjunctive and was not clearer, and that it seemed more > enumerative. My point was that some certificates demonstrate patterns - such > as *both* names - that offer reasonable signals of use. > > I agree that any heuristic approach leaves me profoundly uncomfortable as a > policy, but I would also suggest that some patterns in the certs are signals > that perhaps the impact to users, however great, may be overestimated. > > Of course, all of this is based on the data we have - I agree, that if > StartCom were to log its 2015/2016 certs, we'd be in a much better place to > evaluate viability of minimizing user impact, if such a thing is at all > possible.
For further sake of exploring options, I've been looking at non-public sources to see what other options exist as alternatives. One example set was looking at the hosts visited by GoogleBot over a 60 day period and seeing if any of the certificates seen for a host matched the certificates logged in CT. That is, imagine the key as being constructed from [hash of cert] + [hostname from SAN] for certificates from CT, and in cases of GoogleBot crawls, [hash of cert] + [hostname from link] and [hash of cert] [*.hostname minus a label]. That is, if GoogleBot crawled "www.google.com", it would emit keys for both "*.google.com" and "www.google.com" (to allow it to match with a cert for either name, since browsers will accept either name) While unfortunately, I'm unable to share the specific results, even in buckets, it does suggest that if one were to examine hosts reported in these certificates, with whether or not they use these certificates or are publicly accessible, and further intersect with the Alexa Top 1M, any whitelisting strategy (by host, by domain, or by certificate) could fit in under 50K, with some strategies going below 10K. The reasoning for this is that a number of hosts represented in the certificate don't use the certificate, and instead use it from some other CA provider. A number have switched, for example, to Let's Encrypt, obviating the need for whitelisting. Unfortunately, that's not easily publicly reproducible, which I think is an important aspect for consideration here. So let's again revisit the combined set of WoSign & StartCom certs (which necessarily includes everything GoogleBot has ever seen, but not necessarily any undisclosed and undetected StartCom certs) We know there are 5769 unique certificate hashes with wildcards in the Alexa Top 1M, over 2710 distinct eTLD+1s. There are 61,109 certs that contain non-wildcard hosts, over 18,650 distinct eTLD+1s. Another possibility to explore, then, is to attempt to communicate with each of these hosts and see the certificate they provide, since we can't use hosts mined by Google's crawler (oh how I wish we could). If they provide one of these certificates, the eTLD+1 could be whitelisted, as well as the generous assumption that all wildcard hosts are using their certificates (I believe there's sufficient evidence this isn't the case, but sure). This may help reduce the overall 18,763 distinct eTLD+1s into a even more compressible set, albeit at the cost of potentially excluding some certificates that were (undetectably) in use. _______________________________________________ dev-security-policy mailing list [email protected] https://lists.mozilla.org/listinfo/dev-security-policy

