Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Why Sync Is So Difficult (gigaom.com)
50 points by peter123 on May 10, 2009 | hide | past | favorite | 23 comments


As far as I can tell, Dropbox has actually solved the desktop file-sync problem. It's the first sync product of its kind that I've used that works out-of-box.

Yes, sync is hard. Where I think things fail is mainly around user experience. You just have to make smart decisions and let users recover if they see something they don't expect. Dropbox does this quite well.

I learned this when I was at Microsoft on the ActiveSync team -- we actually were trounced by RIM not because our sync was worse (it was probably superior) but because we initially made the mistake of OVER-reporting status (even minor conflicts that normally you would want to just ignore.)

Sync should be a silent, no/very little UI experience -- a utility that just works in the background. Any attempt to make it more than that will cause the product to fail miserably.


Regarding Sync should be Silent, In a perfect world it would be so, but this is far from it.

There is no way to automate this conflict resolution for binary files (if the same file is modified in two places). You can simply pick the latest modified version, but this is less than optimal (the other modifications will be lost then). Even dropbox isn't totally silent, in a conflict such as this, their server keeps revisions (so you can pick and choose the correct one, making it not silent).

Silent Sync (in a automated no-user-interaction way) has , is, and always be a dream (at least for the foreseeable future).


But this is exactly where Dropbox is genius. They just keep both files - they don't scream at you that there's a conflict. In the case that you open up a file and it's not quite right, you're going to look into it. You'll find the alternate file and then copy/paste your changes in. Then you'll re-save and ta-da, you're OK. But there are going to be all sorts of situations where you'll have minor conflicts, and the user will never, ever notice.


SugarSync also does just that: the file is duplicated and the user is left to resolve the conflict at a later time. Actually it does that in a really clever way so that if users keep editing over the already conflicting versions, it doesn't create new conflict versions but rather treats the new files as further versions of existing conflict duplicates...


dropbox would be perfect if i could use my own private server


If someone could put the drop-box GUI on the git content tracker backend...


Lotus Notes has been doing multi-master replication for over 20 years. The very architecture is built on it. Replication can be push-pull, push or just pull. And can be scheduled or user-initiated. It automatically merges changed documents during replication. When the automatic conflict resolution fails, then it saves both copies and requires human intervention. Damien Katz (creator of couchdb) was a programmer for a core part of Notes until about 5 years ago. And Notes was the major influence on CouchDB.

There are a ton of other good features in Notes (and a few bad ones too).


Sing it, sister.

I implemented my own 2-way syncing for ShoveBox and its new iPhone app (wonderwarp.com/shovebox).

I was very thorough in the way I did it, but there are still small issues that I wasn't able to resolve by release time.

I'm going to write a blog post on this soon.


Wouldn't building a sync on top of something like git simplify the whole process?


The problem is that not every change can be auto-merged, and a "normal" user does not want to deal with resolving merge conflicts. Binary file formats present another problem - imagine trying to merge two different changes to the same image or Word document.


It's actually even worse than that. Even plain text files can cause problems. You make a change to a file over here, then you delete the file over there. What should be the result of the merge? There are an infinite number of such screw cases because the "right answer" depends on the semantics of the data. For example: you add a line to a file over here, then you add the same line to the same file in the same location over there. What should be the result of the merge? Should you end up with one extra line or two? What if one line has an extra trailing space? What if it was an extra leading space and the file contains Python code?


But don't these problems exist regardless of your approach? Git and it's way of handling trees of old commits is a decent place to start. It clearly isn't a final solution, but building on top of it seems like a worthwhile direction, at least to me.


> But don't these problems exist regardless of your approach?

Yes, of course. That was exactly the point I was trying to make: it's a fundamentally hard problem.

> Git and it's way of handling trees of old commits is a decent place to start. It clearly isn't a final solution, but building on top of it seems like a worthwhile direction, at least to me.

That depends on what problem you want to solve. If you want to solve the general data-storage-in-the-cloud problem, then Git is fundamentally flawed because 1) one of the inescapable aspects of the problem is that the solution depends on the semantics of the data and 2) Git by design knows nothing about the semantics of the data.


Yes, there are plenty of concepts from git that are useful (like tracking changes to content, rather than files.)

I don't think using git itself directly would solve many of the hairy platform issues, because they are really outside the scope of what git itself tries to do.

There's also the complication that if you want a zero-knowledge approach to privacy (where server admins can't read your data OR your file/folder names) some new data structures need to be invented.


See "Cryptree: A Folder Tree Structure for Cryptographic File Systems" by Grolimund et al. at http://www.dcg.ethz.ch/publications/srds06.pdf


Interesting. The design is somewhat similar to how SpiderOak works. They were both created at about the same time, it seems.


Interesting idea, I think most of them allow at least some versioning - whether it's accessible by the user is another story.

My thought would be to store some meta-data on devices about the current version of the file, and just check the remote server for a newer version before it's opened.


SugarSync IMHO takes a backwards approach to syncing. Their original model didn't even include historical versions - wrong way syncs destroyed data! The system works by observing and then replaying the events (file creation, moves, etc.) observed on one device to others.

SpiderOak implements a different approach, having initially built a comprehensive journaling backup. Sync happens as a result of logically combining the journal entries from all available end points. There's no "event replay." The final state for a folder is calculated based based on the user's likely intent from the totality of all actions taken in each folder over time. The set of actions to perform locally on any device is the diff between the calculated end state and the local state.

There are still some good points here. The cross platfrom issues mentioned are subtle. (and SugarSync doesn't even support Linux.) For instance, ":" is a valid character in Mac/Linux filenames but not Windows. And the case sensitivity/insensitivity can create conflicts where they wouldn't otherwise exist.

For character encoding, Windows is actually the easiest with Unicode natively stored. Mac and most Linux distros use UTF_8 but there's nothing stopping users from dumping a bunch of filenames with arbitrary heterogeneous encodings all in the same folder.


The single reason why SugarSync listens to filesystem events is to build the journals and then merge them. I am not sure how that is different, and how that wouldn't match the user's likely intent.

You're right about the fact that there wasn't historical versions. But that was a year back. They're here now, have you not seen them?

Unicode encoding is very subtle. The difficulty is that there are several different - but equivalent - unicode encodings for the same strings, and the different filesystems use different normalizations to make sure that they can compare their strings byte-by-byte.


We are working on a syncing product for companies and everything he wrote is spot on.

I would add that file locking issues is also a huge problem even when it comes to a simple conflict resolution.

Throw in case sensitivity issues, among others and yeah, sync is difficult.


Aaaah yes locked files are fun too, I forgot to mention that in the article.


Some day, we'll have IMAP for files and this will all go away.


   Why Sync Is So Difficult ?
2-phase commit uses all or nothing and asynchronous replication does not use it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: