> The chances of generating two GUIDs that are the same is astronomically small.
> The odds are 1 in 2^122 — that’s approximately 1 in 5,000,000,000,000,000,000,000,000,000,000,000,00.
This is true if you only generate two GUIDs, but if you generate very many GUIDs, the chance of generating two identical ones between any of them increases. E.g. if you generate 2^61 GUIDs, you have about a 1 in 2 chance of a collision, due to the birthday paradox.
2^61 is still a very large number of course, but much more feasible to reach than 2^122 when doing a collision attack. This is the reason that cryptographic hashes are typically 256 bits or more (to make the cost of collision attacks >= 2^128).
Retr0id 7 hours ago [-]
2^61 isn't even that large, well within the compute budget of mere mortals.
mmoskal 3 hours ago [-]
Counting to 2^61 probably is.
To actually find a collision in 128b cryptographic hash function it would take closer to 2^65 hashes. Back of the envelope calculations suggest that with Pollard's rho it would cost a few million dollars of CPU time at Hetzner's super-low prices. Not nearly mere mortals budget, but not that far off I guess.
vlovich123 5 hours ago [-]
Depends on what “isn’t even that large means”. A modern 6ghz machine would probably need 12 years of 24/7 operation to count that high. To me that seems like a lot.
dgrin91 5 hours ago [-]
Yeah, but a nation state server farm can probably cut that down to minutes because their budget can buy a lot of processors. You only need a few hundred to really shrink it down to manageable numbers. And it turns out that nation starts aren't the only ones that have this budget
8organicbits 5 hours ago [-]
What's the threat here?
It's trivial to force a collision. Here's the same UUID twice:
6e197264-d14b-44df-af98-39aac5681791
6e197264-d14b-44df-af98-39aac5681791
Typically, you don't care about UUIDs that aren't in your system and you generate those yourself to avoid maliciously generated collisions. Your system can't handle 2^61 IDs. It doesn't have the processing power, storage, or bandwidth for that to happen. Not to mention traditional rate limiting.
thfuran 4 hours ago [-]
The last several comments were responding to
>2^61 is still a very large number of course, but much more feasible to reach than 2^122 when doing a collision attack. This is the reason that cryptographic hashes are typically 256 bits or more (to make the cost of collision attacks >= 2^128).
8organicbits 3 hours ago [-]
I'm not sure. 6ghz is around 2^61 CPU cycles in 12 years. I.E. basic CPU instructions; counting, not computing a cryptographic hash. Otherwise, where is the cluster that's bruteforcing ~122 bit cryptographic hash collisions in minutes?
vlovich123 2 hours ago [-]
For what it’s worth generating a random UUID for the purposes of collision isn’t generally much more complicated than a few arithmetic instructions which is why I used counting as an example. And as the other poster mentioned generating a UUID collision isn’t a security problem since the UUID tends to be generated within your infrastructure where you can’t really go full blast at generating UUIDs for all sorts of reasons anyway.
For cryptographic applications it is really small because the previous poster is correct that 2^64 is very small for that purpose - a small supercomputing cluster or two could decrypt such a cipher in a reasonable amount of time, which is why symmetric keys are all 256 bits and up to guarantee there’s no way to attack them.
8organicbits 48 minutes ago [-]
I don't think that's quite right. A 128-bit UUIDv4 having a 50% chance of having any collision after 2^61 generations is very different from finding a specific 128-bit symmetric key. The best cryptanalysis of AES-128 is 2^126; nowhere near 2^64. Which is why standards bodies like NIST still recommend AES-128 as a baseline.
PaulHoule 5 hours ago [-]
I think you might have trouble if you tried to assign one to every iron atom in an iron filing.
NoahZuniga 7 hours ago [-]
* not the birthday paradox, but the birthday bound.
8organicbits 7 hours ago [-]
Note that this only considers UUIDv4, the random UUID. Other forms can generate UUIDs that are much closer together. For UUIDv7, UUIDs generated within the same millisecond will have identical 48 bit prefixes (or up to 60 when the monotonic counter from section 6.2 is used).
You need to be generating >100M of them within the same millisecond before even remembering that collisions can theoretically happen.
charcircuit 7 hours ago [-]
>You
The entire universe. Else it's not universally unique.
8organicbits 5 hours ago [-]
I like UUIDv7s as database IDs since they sort chronologically, are unique, and are efficient to generate. My system chooses the UUIDs; I don't allow externally generated IDs in. If I did, then an attacker could easily force a collision. As such, I only care about how fast I create IDs. This is a common pattern.
If your system does need to worry about UUIDv7s generated by the rest of the universe, you likely also need to worry about maliciously created IDs, software bugs, clocks that reset to unix epoch, etc. I worry about those more than a bonefide collision.
tonyhart7 4 hours ago [-]
Your app is must be popular to be having an entire universe "amount" of users lol
joke aside all of this is theorical, in practical application its literally impossible to hit it that it doesn't matters if its possible or not since you are not google scale anyway
charcircuit 3 hours ago [-]
It's not just your app. It's any other app or data provider that you may now or in the future interact with.
tgv 8 minutes ago [-]
Only if the other side uses your key as theirs, and uses it to store data from many sources. I, personally, don't feel it's hardly worth considering. A primary key under your own control doesn't cost much, and is a better choice.
nopassrecover 3 hours ago [-]
Reminds me of a problem I ran into once where someone had wanted unique but short codes as identifiers for relatively small counts, and picked a substring of a UUID:
> However, the overall takeaway was: Don’t use the MongoDB Increment value as a Unique Identifier.
However, the overall takeaway should be, as always: don't use MongoDB. Period. Every time I learn something new about it I'm baffled about why people continue to use it.
amingilani 7 hours ago [-]
Instead of picking a target UUID and evaluating new UUIDs against it, a better experiment would be finding duplicates in all the UUIDs you have generated.
This plays nicely with the birthday paradox.
webstrand 7 hours ago [-]
This is the chance that given a specific guid, that you'll find a collision for it. Utterly minuscule chance. However birthday paradox controls, if you generate 2^62.60 guids the chance that you've generated a collision is around 99%. Still enormously unlikely, but way smaller than 2^122.
At a rate of comparing 400,000 guids per second, you have a 99% chance of seeing a collision within the next 553,750 years.
jonathrg 6 hours ago [-]
You would need a little more memory to see/detect that collision.
nesk_ 8 hours ago [-]
Nice experiment. Is the code available somewhere?
RS-232 6 hours ago [-]
UUID > GUID.
Microsoft’s GUID standard is garbage.
lionkor 6 hours ago [-]
Oh, why?
w-ll 5 hours ago [-]
not OP but i already have fields for time ts and what model it is. i want my uuids random.
kaoD 5 hours ago [-]
I think the current Microsoft GUID is just UUIDv7.
I don't think there's a "Microsoft standard" and they just use different versions of UUID in different products over time. No idea why they call it GUID instead of UUID though, but it's easier to speak out loud so I'm not against it.
v7 has a timestamp indeed, but isn't the time making it more collision resistant? You'd have to generate tons of UUIDv7s in the same millisecond, while v4 is more likely to collide due to not being time-constrained and the birthday paradox.
I think both have their uses though. You might need pure random if you want your UUID not to convey any time information and you're not generating tons of them (e.g. a random user id).
What do you mean "model"? Are you referring to UUIDv1 which has time and MAC address?
JdeBP 45 minutes ago [-]
Years ago, Microsoft took the same algorithm that was being used to generate these things for Remote Procedure Calls in the Open Software Foundation's Distributed Computing Environment and used that algorithm to generate IDs for its Component Object Model. This was all happening in the late 1980s, and at a point where none of it was hard and fast.
If you were doing RPC in OSF DCE your IDs were UUIDs, and if you were doing COM in Wintel your IDs were GUIDs; and that was basically the difference, a different name for the same thing when used in a different domain.
Plus the difference in endianism because one was a network-byte-order network thing and the other was an Intel Architecture byte order thing, and only some parts of these IDs were technically multiple-byte integers with byte orders to have.
But by the late 1990s this had already become lost to history, with a sea of people who had made all sorts of inferences and promoted them as gospel truth, from the fact that Microsoft had two programs named GUIDGEN.EXE and UUIDGEN.EXE, from the fact that many generators sprang up and the whole idea spread to Java and databases and this new-fangled WorldWeb thing and all sorts of stuff, from the fact that there sprang up multiple different versions of these IDs and what version an ID was depended from tooling and libraries, and from the fact that at the time Microsoft was less likely to go through formal standards processes and more likely to just write and ship things and sponsor a book and a CD-ROM of doco so if your world was RFCs and the IETF you had one worldview and if your world was Microsoft Press and the MSDN you had another worldview.
Zambyte 4 hours ago [-]
> isn't the time making it more collision resistant?
That seems to depend a whole lot on the pattern your application generates UUIDs in. If you're generating a consistent distribution over time, sure. If you generate a whole lot in bursts, collision seems to be way more likely.
kaoD 4 hours ago [-]
You have to generate 2^37 (137,438,953,472) UUIDv7s in the exact same millisecond to have a 50% chance of collision.
(Not disagreeing with you, just adding perspective.)
8organicbits 4 hours ago [-]
The math is interesting here as you'll probably want to run your system for several years, not just a single millisecond. So it's a repeated trials problem. I spent some time trying to figure out the ID generation rate that would be a "break even point" between UUIDv4 vs UUIDv7, but I didn't trust the answer I got.
> The odds are 1 in 2^122 — that’s approximately 1 in 5,000,000,000,000,000,000,000,000,000,000,000,00.
This is true if you only generate two GUIDs, but if you generate very many GUIDs, the chance of generating two identical ones between any of them increases. E.g. if you generate 2^61 GUIDs, you have about a 1 in 2 chance of a collision, due to the birthday paradox.
2^61 is still a very large number of course, but much more feasible to reach than 2^122 when doing a collision attack. This is the reason that cryptographic hashes are typically 256 bits or more (to make the cost of collision attacks >= 2^128).
To actually find a collision in 128b cryptographic hash function it would take closer to 2^65 hashes. Back of the envelope calculations suggest that with Pollard's rho it would cost a few million dollars of CPU time at Hetzner's super-low prices. Not nearly mere mortals budget, but not that far off I guess.
It's trivial to force a collision. Here's the same UUID twice:
6e197264-d14b-44df-af98-39aac5681791
6e197264-d14b-44df-af98-39aac5681791
Typically, you don't care about UUIDs that aren't in your system and you generate those yourself to avoid maliciously generated collisions. Your system can't handle 2^61 IDs. It doesn't have the processing power, storage, or bandwidth for that to happen. Not to mention traditional rate limiting.
>2^61 is still a very large number of course, but much more feasible to reach than 2^122 when doing a collision attack. This is the reason that cryptographic hashes are typically 256 bits or more (to make the cost of collision attacks >= 2^128).
For cryptographic applications it is really small because the previous poster is correct that 2^64 is very small for that purpose - a small supercomputing cluster or two could decrypt such a cipher in a reasonable amount of time, which is why symmetric keys are all 256 bits and up to guarantee there’s no way to attack them.
https://www.rfc-editor.org/rfc/rfc9562.html#monotonicity_cou...
The entire universe. Else it's not universally unique.
If your system does need to worry about UUIDv7s generated by the rest of the universe, you likely also need to worry about maliciously created IDs, software bugs, clocks that reset to unix epoch, etc. I worry about those more than a bonefide collision.
joke aside all of this is theorical, in practical application its literally impossible to hit it that it doesn't matters if its possible or not since you are not google scale anyway
http://mattmitchell.com.au/birthday-problems-friendly-identi...
However, the overall takeaway should be, as always: don't use MongoDB. Period. Every time I learn something new about it I'm baffled about why people continue to use it.
This plays nicely with the birthday paradox.
At a rate of comparing 400,000 guids per second, you have a 99% chance of seeing a collision within the next 553,750 years.
Microsoft’s GUID standard is garbage.
https://learn.microsoft.com/en-us/dotnet/api/system.guid?vie...
I don't think there's a "Microsoft standard" and they just use different versions of UUID in different products over time. No idea why they call it GUID instead of UUID though, but it's easier to speak out loud so I'm not against it.
v7 has a timestamp indeed, but isn't the time making it more collision resistant? You'd have to generate tons of UUIDv7s in the same millisecond, while v4 is more likely to collide due to not being time-constrained and the birthday paradox.
I think both have their uses though. You might need pure random if you want your UUID not to convey any time information and you're not generating tons of them (e.g. a random user id).
What do you mean "model"? Are you referring to UUIDv1 which has time and MAC address?
If you were doing RPC in OSF DCE your IDs were UUIDs, and if you were doing COM in Wintel your IDs were GUIDs; and that was basically the difference, a different name for the same thing when used in a different domain.
Plus the difference in endianism because one was a network-byte-order network thing and the other was an Intel Architecture byte order thing, and only some parts of these IDs were technically multiple-byte integers with byte orders to have.
But by the late 1990s this had already become lost to history, with a sea of people who had made all sorts of inferences and promoted them as gospel truth, from the fact that Microsoft had two programs named GUIDGEN.EXE and UUIDGEN.EXE, from the fact that many generators sprang up and the whole idea spread to Java and databases and this new-fangled WorldWeb thing and all sorts of stuff, from the fact that there sprang up multiple different versions of these IDs and what version an ID was depended from tooling and libraries, and from the fact that at the time Microsoft was less likely to go through formal standards processes and more likely to just write and ship things and sponsor a book and a CD-ROM of doco so if your world was RFCs and the IETF you had one worldview and if your world was Microsoft Press and the MSDN you had another worldview.
That seems to depend a whole lot on the pattern your application generates UUIDs in. If you're generating a consistent distribution over time, sure. If you generate a whole lot in bursts, collision seems to be way more likely.
(Not disagreeing with you, just adding perspective.)
(Agreeing with both parents)