You should do both. Sanitize your inputs so that it can be safely stored in your...

the8472 · on Feb 27, 2020

Most input by humans is unicode. Most people do not understand unicode. Don't try to sanitize it other than checking it's valid utf8 - which hopefully your programming language's string type/http parameter deserializer/DB engine already do for you.

For example some people think that stripping ZWNJ, ZWJ or other kinds of spaces is a thing they ought to do because it confuses their markup parsers or can be used to encode hidden information in posts or stuff like that. Guess what, it breaks emoji, arabic, some asian languages and a bunch of other things.

If you only sanitize your output and realize you made a mistake you can easily fix it by changing your output sanitzing algorithm. If you santizied your input you threw away data and can't fix your problem.

nine_k · on Feb 27, 2020

See also: wtf-8 encoding, used by Mozilla for potentially broken Unicode input.

klodolph · on Feb 27, 2020

WTF-8 is for a very specific application—to encode in an 8-bit sequence a 16-bit sequence which is nominally UTF-16 but may contain unpaired surrogates. It’s not used for weird user inputs, instead, it’s used for e.g. interop with the Win32 API.

laumars · on Feb 27, 2020

Are you sure that’s what is happening at Reddit? You shouldn’t need to sanitise your inputs for SQL. Paramatised SQL has been a thing in some languages for two decades now. This really is a long solved problem by now.

Output is a different matter though but that’s because of rendering content safely down to HTML, JavaScript or JSON (to name a few examples). SQL shouldn’t come into the equation by this point.

tie_ · on Feb 27, 2020

This. I'm tired of people implying or right out stating that SQL injection is an input validation problem. Why couldn't you have foo' OR 1=1; as a title of your post? It is all good characters as far as text entry is concerned.

SQL injection is really a problem of how you pass parameters to your SQL layer. Parametrized queries are the (easy and widely available) solution. If you are concatenating input to your SQL queries, you're doing it wrong.

squiggleblaz · on Feb 27, 2020

How many people named O'Brien are told they can't sign up, or passwords get rejected because they contain special characters?

It's crazy.

Even if you're using 1990s technology without parameterised queries, it's not like it's impossible to say `insert into users (name, motto) values ('O\'Brien', 'foo\' OR 1=1;')`.

fwsgonzo · on Feb 27, 2020

Yep. I've stopped counting the amount of websites that force me to use weak passwords. It's crazy that this is still a thing in 2020.

I wish the controls on browsers came with a green V that implements best practices (8+ symbols, no filter) so that people who made websites understand that this is what they should conform to. Not their own misconceptions about password security.

lodovic · on Feb 28, 2020

I wish websites would stop enforcing a "password policy". An insecure password should be a choice. If you are so sure you cannot secure your site, leave authentication to a third party provider. All this leads to is zillions of user accounts that are used only once.

hinkley · on Feb 29, 2020

I’m sorry, your password must be 16 characters or less and contain no white space or punctuation.

chaz6 · on Feb 28, 2020

I have heard stories about people in Ireland struggling to get certain services because their name contains a Fada, but some of the identity paperwork they have is missing the Fada due to lack of support by computer systems.

simias · on Feb 27, 2020

I blame PHP. Many webdevs active today started with it, and the standard library's solution to injections was escaping everything half a dozen times just in case. Because PHP being PHP nobody saw any red flags when they implemented a function named "mysql_real_escape_string". Apparently they've deprecated these functions since then, but the damage is done.

ivanhoe · on Feb 27, 2020

But that's not a thing for 15 years or more? PDO was added around 2005, and even before that anyone in their right mind used mysqli extension for prepared statements. Since 2012 you can't even use the mysql extension without getting a depreciation warning.

And yes, in 90s php's security sucked, but that was nothing php specific, it was just the sentiment of that time. Everyone did it, in all languages. I remember using tons of $dbh->do() in Perl's DBI back then, intentionally avoiding to prepare statements for a quick and dirty stuff (and most of the scripts back then were quick and dirty stuff). It's in a big part because we were used to building desktop apps and thinking in terms of security that applied for them like being careful about your pointers and input strings lengths and stack overflows and stuff. Web was still pretty new thing.

teh_klev · on Feb 27, 2020

> But that's not a thing for 15 years or more? PDO was added around 2005

Ex-shared hosting bod here, who had the joy of managing our PHP environments :(

Sadly in the real world, even after the great big (and pointless) act of deprecating and removing the mysql_* library, naive developers (and experienced ones that should've know better) just moved onto mysqli_* or PDO and still used string concatenation with raw inputs, instead of learning how to parameterise their queries.

Used to drive me flippin' nuts.

ivanhoe · on Feb 28, 2020

> naive developers (and experienced ones that should've know better) just moved onto mysqli_* or PDO and still used string concatenation with raw inputs, instead of learning how to parameterise their queries.

True, I stand corrected, I've just checked and Wordpress still does it just like that: https://github.com/WordPress/WordPress/blob/master/wp-includ...

hinkley · on Feb 29, 2020

The one and only time I argued with someone about PHP, the forum the guy ran got hacked the very next morning and was running a botnet of some sort. I was smug but quiet, and I never really thought about it but the timing makes me wonder if he thought I was involved.

Anyway, there’s a patch release the next day, and somewhere I find the diff. Now I can’t read PHP but I know what string concatenation looks like, especially if someone does a diff on it. I’ll be damned if the diff didn’t fix one SQL string concatenation that was less than five lines from code with the same structure. Scsry.

nakkijono · on Feb 27, 2020

PHP and vulnerable example code is an additional thing. Most people just copypaste from tutorials. For example, the first search result for "PHP mysql example" gives you the wrong example first https://www.w3schools.com/php/php_mysql_select.asp

auiya · on Feb 27, 2020

> Are you sure that’s what is happening at Reddit?

"I was the first (paid) employee of reddit" https://www.jedberg.net/

I mean...

laumars · on Feb 28, 2020

That doesn't mean the information supplied was accurate. Which, as it happened, it wasn't and this whole thing was really just a massive misunderstanding.

jedberg · on Feb 27, 2020

When I said "sanitized input" I was simplifying parameterized SQL in the case of reddit. But since I said "for your datastore" I was being less specific since different datastores require different methods.

laumars · on Feb 27, 2020

Parametrised SQL isn’t sanitising your input. It’s injecting those values in at byte code so your values are separate to the query language. Calling that process “sanitisation” is, at best, highly misleading.

Also the methods don’t really change across different SQL databases, at least not conceptually. Sure the RDBMS drivers might change but these days that stuff is usually abstracted away into a single framework for SQL. The real significant change would be switching to a NoSQL database but if you’re doing that then it’s not SQL you need to be “sanitising” anyway.

jedberg · on Feb 27, 2020

> Calling that process “sanitisation” is, at best, highly misleading.

That's a fair criticism. But that is what I've called it for a long time.

> The real significant change would be switching to a NoSQL database but if you’re doing that then it’s not SQL you need to be “sanitising” anyway.

Right, that's why I started off by saying "Sanitize your inputs so that it can be safely stored in your data store", so that it could apply to any data store.

It's just a terminology argument at this point. But my main point was that you need to still think carefully about how you're going to store your data and do it safely, and then also make things safe on the way out, like the article suggests.

shawnz · on Feb 27, 2020

> That's a fair criticism. But that is what I've called it for a long time.

I think many people would agree that "sanitization" is loosely defined and I think that's exactly what led to the misunderstandings that the OP's article is trying to address.

> that's why I started off by saying "Sanitize your inputs so that it can be safely stored in your data store"

That could be interpreted like "escape quotes at the start of the request if you know that you're using a database where quotes have a special meaning" a la PHP magic quotes, which I'm guessing is not what you meant but it is what the OP is criticizing. The key is that the sanitization (or whatever you want to call it) shouldn't happen until you're ready to insert into the DB, otherwise that data will be coupled with database logic through the whole flow of your app

wccrawford · on Feb 27, 2020

You don't "sanitize your inputs" for your datastore. You escape the outputs as you send it to your datastore. (Either through oldschool methods, or parameterization.)

Sanitizing your input means changing them permanently. You don't actually want to do that. You want to store exactly what the original value was, but you want to do it safely. When you retrieve it again, it should be the same as it was originally.

If you "sanitize" the input, you won't necessarily have the original value ever again.

ashearer · on Feb 27, 2020

Yes. It's the word "sanitize" itself that misleads people. It creates the mindset that input from users is dirty and must be made clean, and "clean" is "safe" to use in any context.

(I've seen the line of thought taken one step further: taking the realization that it's impractical to make strings universally safe for any context—even if you HTML entity-encode it twice, what if a recipient decodes it three times?—and concluding that security is hard and we can only approach it asymptotically, so shrugs XSS-like bugs are normal and unavoidable given finite time & budget.)

If the mindset is more like converting units, it becomes clearer. You can't concatenate HTML with a general Unicode string without converting the string to HTML first, any more than you can add inches and centimeters directly. "Cleaning" the centimeters would make no sense.

nulbyte · on Feb 27, 2020

> But my main point was that you need to still think carefully about how you're going to store your data...

I think this is true, but doesn't quite reach the point GP is making: speaking correctly is also important. Calling parameterization input sanitization communicates the wrong message. And abstracting the wrong solution to apply it to a different problem isn't all that helpful. You could just as easily encode or hash input to fit the underlying data format without losing data (except in the case of truncation), but that isn't input sanitization, either.

Input sanitization is strictly checking front-end input against a ruleset and rejecting anything that does not comply. This is fundamentally different than dealing with anything thrown at you and handling it gracefully.

Izkata · on Feb 27, 2020

> Input sanitization is strictly checking front-end input against a ruleset and rejecting anything that does not comply.

That's validation. Sanitization involves altering the data to make it safe (comes from the word "sanitary").

organsnyder · on Feb 27, 2020

Rejecting input is a valid sanitization strategy, though.

SahAssar · on Feb 27, 2020

That's not how the word is usually used within development in my experience though.

I think most devs think of sanitization as "make X safe", not "see if X is safe, if not reject" since that is usually called validation.

Using hand sanitizer does not remove your hands if they have harmful bacteria. The hands are still there, just cleaned from (some) of the harmful parts.

Sean1708 · on Feb 27, 2020

The article refers to that as escaping output rather than sanitising input:

> And of course use your SQL engine’s parameterized query features so it properly escapes variables when building SQL:

mannykannot · on Feb 27, 2020

We should certainly use parameterization always, and this will allow us to save 'Bobby Tables' type input into our databases, but we should acknowledge that there is then the potential risk that some internal, probably non- public-facing program or script, either now or in the future, will contain a bug that leads to its execution.

The spread of natural language processing into systems and analysis tools might increase the scope for this sort of thing.

laumars · on Feb 27, 2020

I agree you can never say never but you also can't sanitise against a risk that hasn't been defined yet simply because you wouldn't know what needs to be sanitised.

For example what if your NLP is bootstrapped from a shell script and your database content has been stripped of SQL but still contains stuff that might be interpreted as $(sub-shells)? Before long you run into a situation where literally no characters are considered safe (eg even alpha characters in the English alphabet are used as tokens in some programming languages and "what if someone builds a script in one of those languages?").

The only sane way to address the unknown is to treat raw strings as "dirty" and follow best practices when handling them (plus all the usual processes to properly test your code before it's used in production). In which case you're back to no longer needing input sanitisation.

mannykannot · on Feb 27, 2020

My point is not that input sanitization is a solution, but that parameterized database input is not the end of the issue, from a broader, whole-systems security point of view.

Many data types have some concept of well-formedness, and in those cases, there are pragmatic reasons for only accepting well-formed input that go even beyond the security aspect.

laumars · on Feb 27, 2020

I completely agree and I also said this in my original comment you replied to.

This is why I got confused when you said "but we should acknowledge..." (ie thinking you were raising a point other than what myself and others had already acknowledged).

mannykannot · on Feb 27, 2020

It is not clear to me that the post I originally replied to did acknowledge this point.

laumars · on Feb 27, 2020

> You shouldn’t need to sanitise your inputs

> Output is a different matter though

"Output" doesn't have to be front end rendered nor even internet facing. It could also be input for another internal process.

mannykannot · on Feb 27, 2020

Indeed, so there was a point to be made.

laumars · on Feb 28, 2020

I honestly don't understand what the point you're trying to make is then.

If you're saying people should be aware that handling data safely requires more steps than just parametrised SQL then yes I touched on that, as have others, and it's not something anyone is unaware of. Hence why there has been so many high quality posts discussing the different methods of validation, sanitisation and escaping. So it's a rather strange position to assume when you say "we need to acknowledge" given that's what everyone (including me) has been doing. But it never hurts to be categoric about important points like that so your original post is still relevant.

If you're making some other point then I've already had two stabs at deciphering it and failed both times. So it's really not clear what that point is.

However if your intention was just trolling me then fine, I bit and you won.

TeMPOraL · on Feb 27, 2020

> Sanitize your inputs so that it can be safely stored in your data store,

That's not sanitizing input, that's escaping output - if you subdivide concerns appropriately. The data you're saving into database is the output of your program.

Really, the problem is of language translation. User input is an unstructured blob. SQL, or HTML, are structured languages, with their own semantics. Whenever you cross the language barrier, you need to translate data from one language to the other. Parametrized queries are the usual API to SQL drivers, and they do this for you under the hood, producing a valid SQL query string[0]. When going to HTML+JS, you need to invoke some library (or do translation yourself).

(I really don't like the term "escaping". Translating between languages is more than just sticking slashes in front of double quotes.)

This is why "sanitizing inputs" is a nonsense concept. The problem is of language translation, and you can't translate if you don't know the destination language. A blob correctly sanitized for SQL will not be correctly sanitized for HTML, and an input correctly sanitized for both will look bad in either.

SQL injection and XSS are the same bug. Failure to translate between languages. Usually caused by a pretty stupid but somehow very popular idea - building target language expressions by gluing plain strings together[1].

--

[0] - Sometimes. I remember this is how it worked in the past, but not sure if server RDBMS APIs haven't changed since. For comparison, with SQLite, you're passing the query with placeholders to the SQLite functions, and the parameter values are passed as arguments. The SQLite internals turn this into an executable query, but I'm pretty sure this does not involve a query string with actual parameter values in it ever existing in memory.

[1] - A related source of footguns is using templating engines for web pages. HTML is a tree of nodes, not a plain string. Using a template system is a recipe for XSS problems.

gmueckl · on Feb 27, 2020

Any actually decent RDBMS isn't stupid enough to first escape parameters, then parsing the query string to find placeholders, then do a bunch of string concatenation and then run the concatenated string through a second parser. It is really simpler and more robust to parse the query string once and grab the actual data value from the parameter array once a placeholder is found.

However, a lot of client side libraries are cheating and embed the parameters into the query string before passing it to the server instead of implementing the proper parts of the protocol. This is about as safe as doing plain string concatenation in the first place. I don't trust the library authors to actually get this right.

benhoyt · on Feb 27, 2020

There's value in doing both in certain cases (as long as you're definitely escaping output), but I'd be careful about the motivation. You say: "Sanitize your inputs so that it can be safely stored in your data store" -- but it's safe to store any string in your database as long as you escape/encode it correctly (i.e., use parameterized queries).

For example, what if someone posts a legitimate comment on reddit helping someone with SQL syntax for deleting tables and includes "DROP TABLE users" -- did you sanitize that away?

jedberg · on Feb 27, 2020

> did you sanitize that away?

No, see my edit above. It was just parameterized.

iainmerrick · on Feb 27, 2020

What does the “basic SQL sanitization on the way in” consist of and what does it do for you?

The article makes sense to me; if I’m just storing strings in a database, I don’t see why they would need to be sanitized at rest, even if they contain malicious SQL code. Only when I actually come to use those strings for some purpose.

bardan · on Feb 27, 2020

But what if at some point somewhere down the line someone forgets to sanitize the output? Surely better procedure to sanitize at both ends. Nobody is perfect.

JimDabell · on Feb 27, 2020

You're thinking about this as if data can be in one of two states – untrusted or sanitised. This is not the case.

When you output arbitrary data, you need to encode it in a way that is suitable for that context. These contexts might be:

- Generating a web page.

- Including in a JSON response from an API.

- Sending an email.

- Storing in an SQL database.

These all use different formats / protocols that use different syntax to encode data. How you correctly encode data for one of them is different to how you correctly encode data for another of them. There is no method of taking untrusted data and "sanitising" it so that it is correct for all of them. What works for one will break for the rest.

If you want to handle arbitrary data correctly and safely, store it as-is and when the time comes to use it, encode it appropriately for the context you are using it in. Where possible, use tools and systems that get it right by default instead of requiring developers to remember to encode correctly, e.g. generate HTML with templating engines that encode data as HTML by default, and use parameterised queries with SQL.

TeMPOraL · on Feb 27, 2020

> generate HTML with templating engines that encode data as HTML by default

Don't, unless you're sure the templating engine actually parses the HTML into a tree of nodes before interpolating and re-emitting it. Otherwise it's likely someone will interpolate something in an improper context, e.g. inside <script> or <style> block.

cjfd · on Feb 27, 2020

This is completely the wrong attitude. Code should be correct, not fail-safe. All of the safety hatches that people tend to introduce make the code less predictable and eventual problems tend to arise far away from where they originated, making them difficult to diagnose. What is the result of this sanitization? Now we have some undefined, changing internal string format running around in our application, and possibly multiple undefined and changing internal string formats, where it is also unclear at what point a string is supposed to be in what format. If something is an arbitrary string it should be allowed to be an arbitrary string and the things handling that should escape it the appropriate way. The article is correct.

Scarblac · on Feb 27, 2020

Code should ideally be correct and fail-safe; deal gracefully with bad input, ensure correct output, at all levels. Ideally.

That still doesn't mean you should care about output sanitization on data entry, as you don't know how it will be output yet.

zAy0LfpBZLC8mAC · on Feb 27, 2020

> Code should ideally be correct and fail-safe; deal gracefully with bad input, ensure correct output, at all levels. Ideally.

NO!

If you get bad input, you should fail, loudly. Anything else is a recipe for disaster.

(Not as in "crash the whole system", of course, but as in "reject the request".)

trashbindigger · on Feb 27, 2020

I’m confused because I agree with this comment, but your other comments mentioned:

> "sanitizing input" is plain nonsense

> "Unsafe input" is not a thing.

Have you ever used hand sanitizer? The point of hand sanitizer is to reject infectious diseases outright.... you know, so they don’t get IN your body. You seem to have adopted a narrowly defined sense of sanitization which does not include “mercilessly discard/destroy”.

zAy0LfpBZLC8mAC · on Feb 27, 2020

> The point of hand sanitizer is to reject infectious diseases outright.... you know, so they don’t get IN your body.

There is no such thing as "infectious input data".

> You seem to have adopted a narrowly defined sense of sanitization which does not include “mercilessly discard/destroy”.

None of the dictionaries I just checked support such a definition. They are all about "changing something to be more sane/sanitary/pleasant/acceptable/...".

Also, mind you a hand sanitizer doesn't destroy your hand, it destroys microbes, in order to make your hand sanitary, so as to enable you to continue using your hand instead of rejecting/disarding it. Which is exactly the kind of thing you should not ever do with input data.

eythian · on Feb 27, 2020

This tends to leads to systems that don't work and die on real-world edge cases.

zAy0LfpBZLC8mAC · on Feb 27, 2020

No, it doesn't. That is what "accept anything" programming leads to.

The idea that "accepting all the inputs" somehow gives you an advantage is an illusion: If the semantics of some input are not well-defined, then the only thing you gain by accepting it anyway are hard to debug interoperability problems and vulnerabilities. When some input is not well-defined accoridng to the spec, then your interpretation is just a random guess, and the next developer will make a different random guess as to what that input means, and so an interoperability problem and potential vulnerability is born. If you reject the invalid input, you will notice the error and thus fix the source of the invalid input to produce input for which the semantics are actually well-defined.

ashearer · on Feb 27, 2020

This sounds good in theory, but I'll give a counterexample.

Requirement: Name input box.

Implementation: We'll sanitize the input by rejecting any characters likely to be dangerous if mishandled, like single quotes, or anything else we don't immediately imagine to be useful. If a character turns out to be needed later, that's no problem. We'll just change the list.

Security audit: Passes

Later customer complaint: I can't sign up! — J. O'Brien

Dev team: Sorry, too bad. We'd have to re-audit everything and possibly modify code to allow your last name, because there might be code somewhere that relies on the original sanitization for security. That was the point of sanitizing on input, after all. If you want to sign up, it would be easiest for us if you would just change your name.

zAy0LfpBZLC8mAC · on Feb 27, 2020

I think you misunderstood my point. I am not saying that you should reject valid (that is: semantically meaningful) input, but that if you are confronted with semantically meaningless input, you should reject it rather than garble it so that it gains some random meaning.

So:

Name input field, value "J. O'Brien": accept

JSON parameter, value "{foo:bar}": reject

The context was the idea that you should gracefully accept bad input. If your code considers "J. O'Brien" bad input for a name, then that's the problem, not that it doesn't accept bad input.

ashearer · on Feb 27, 2020

Yes, I completely agree in the above case. The JSON input has a well-defined format and input validation should reject it outright.

The issue is that when developers hear they should "reject bad input" in order to avoid vulnerabilities, they often interpret it as a call to reject any user input that isn't already known to be good. Since user inputs are often free text, like the name field, they wind up forbidding any input they hadn't specifically imagined, which doesn't align with any particular recipient's actual data requirement. It creates false-negative edge cases while only providing illusory help against vulnerabilities.

zAy0LfpBZLC8mAC · on Feb 27, 2020

I mean, I generally agree, but I think it's already problematic to frame it as "user input that isn't already known to be good". Because "J. O'Brien" is known to be good. The problem is that anyone thinks in the first place that some semantically meaningful input value for some reason is not good.

Scarblac · on Feb 27, 2020

You can't sanitize for output at input time, as the sanitization that needs to be applied is different for HTML, JS and JSON. You don't know that at input time.

XMPPwocky · on Feb 27, 2020

Double-escaping is silly & it\'s just plain incorrect.

laumars · on Feb 27, 2020

You might need to escape strings differently depending on where you’re outputting eg HTML or JSON.

jve · on Feb 27, 2020

Well, use libraries/frameworks that ENFORCE you to sanitize and makes that exceptional case to output raw content.

Examples. PHP: Using mysql_escape_string is a no-no - you will forget to add it one day. Using parametrized queries you won't write unsafe SQL.

.NET Core - Outputting to HTML by default only outputs those chars to HTML which are in predefined UTF range. All other chars will be converted to HTML entities. If you want to output raw, you must explicitly use @Html.Raw https://docs.microsoft.com/en-us/aspnet/core/mvc/views/razor...

thephyber · on March 1, 2020

This doesn't pass the sniff test. If someone can forget to sanitize the output, someone can also forget to sanitize the input. The most important things are to understand where the content is used, use the appropriate output encoding/escaping, have rigorous tests to ensure your expectation correctly escapes nasty strings, and that you keep the output escaping code up to date to protect against novel attacks and new browser/app features.

I worked at a social media company with one of the largest text-based user-content-stores in the world at the time. Some of the features had input-side encoding and some had output-side encoding. I was there ~10 years after the bad practice of input-side encoding started and it very quickly became too cumbersome to know exactly which fields were encoded with what encoding (and I mean both character encoding and htmlentities / specialchars / specific character stripping / etc). We started getting ridiculous bugs like passwords could not contain '&' characters or logins would fail matching what we had in the DB.

It's not about being perfect. That will never happen. It's about storing exactly what the user submitted (if it is accepted by the POST submission logic) and to correctly encode the output for the correct security context (HTML, XML, JSON, html entities, html attribute, script tag, styles/stylesheet, urls, uploaded filename / file contents, filesystem injection, command injection, etc). These all have different rules. You can unintentionally open yourself to a vulnerability in one if you only expect the output to be displayed in HTML.

fetbaffe · on Feb 27, 2020

I'm no fan of sanitizing inputs, transform unsafe input to safe input and the storing it, because as you say, someone will find a way to circumvent that.

What I always do is to exactly specify what is allowed in any input by parsing, schema validation. If it is HTML I run a HTML parser to validate accepted tags & attributes. If it is plain text i validate that there is no HTML in it, etc.

If the input fails the filter then you deny the request.

This has the advantage that you always know what data structures you are storing in the database and that will make future data migrations much easier.

Drawback is that if your filter is too strict then you deny a valid request, however it is easier too loosen a filter later than migrate unwanted/unknown data that you accidentally accepted.

Stored input is also part of your database schema.

And of course, always escape output even if you know the data is "safe".

lucideer · on Feb 27, 2020

Replying to the edit for further clarification:

> Apparently I shouldn't have simplified "paramaterazation of SQL" as "sanitize your input"

It's not a simplification, it's incorrect. An SQL query is output. You are sending data out of your application code, via the SQL library driver.

It may seem like I'm splitting hairs here, but I've seen this distinction misunderstood in this way often enough in situations that severely compromise security.

Some people think of "output" as purely the end product of application flow, and all I/O that happens in between is somehow lumped together as indistinct "input". There's a reason I/O has two separate letters, and the distinction is crucial for securing your application.

TL;DR:

values passed to an I/O function (like stdout, file.write(), response.write(), db.query(), etc.) are all output, those returned from an I/O function (like stdin, file read, db query results or requestObject.getQueryParam()) are input. Sanitize the former, NOT the latter.

Validate the latter (ideally. Though I'd say this is more about stability than security).

bloak · on Feb 27, 2020

> Sanitize your inputs so that it can be safely stored in your data store

What kind of data store can't hold all possible strings of characters? Except perhaps '\0'. I'll make an exception for '\0'.

(EDIT: There is of course the question of what you mean by a character. Sometimes it's an octet. Sometimes it's a Unicode code point.)

jedberg · on Feb 27, 2020

It’s more a question of how you get them in there. For example if your data store is JSON based, you’ll need to escape some strings.

TeMPOraL · on Feb 27, 2020

That sounds like a broken datastore API. In a properly designed API, you don't need to escape anything, because the API implementation ensures your data doesn't get read as code.

dropmann · on Feb 27, 2020

It depends on the perspective, in case of SQL I would argue that sanitizing the input is the same as escaping the output, because the query you are sending to the database is the output.

Escaping the output however as a term implies you are doing it right, while sanitizing the input could also mean you just replace("DROP", "") etc. (My last name is Dropmann, I know what I am talking about)

ashearer · on Feb 27, 2020

The difference is where it's done. "Sanitizing the input" implies that it happens when the value is read, so that all uses of the value are stuck with a single result. "Escaping the output", in your example, would happen in the database or its driver, for parameterized queries. HTML output of the same value in the same request would be escaped differently within a function that builds HTML output.

pirate_dev · on Feb 27, 2020

Can you explain to me why reddit feels like it is held together with duct tape? IMHO, it has the most problems with site uptime and basic functionality of any major site on the net. I am always getting search problems, site unavailable, or some other such glitch with it. I can't believe you guys just don't know what you're doing, so what does the present setup offer that is worth this shitty performance?

cerberusss · on Feb 27, 2020

> always getting search problems, site unavailable, or some other such glitch

This sounds like my spouse saying "you always...". It's obviously not "always"; I think you'll get a better reply when you quantify it. For example, in the last month, how many times did you get a search problem (which was it), a site unavailable or something else (what was it).

pirate_dev · on Feb 27, 2020

It's about once per session for me. Seriously, compared to every other major site I know of, it is in a class of its own for flaky UX. At least one of the things I describe above per afternoon, lets say. Often many more than one if it is having serious problems. As for the issues with you and your spouse, I will just say it sounds like there is a opportunity for improved communication there. Best of luck.

robotnikman · on Feb 27, 2020

IIRC I do not think he is part of Reddit's administration team anymore.

Alex3917 · on Feb 27, 2020

> You should do both.

You should really do all three:

1) Sanitize input on the server.

2) Use a CSP to prevent your browser from rendering any asset unless specifically whitelisted.

3) Use your front end framework to sanitize text being displayed.

iamaelephant · on Feb 27, 2020

Are you implying that Reddit is concatenating SQL queries instead of parameterizing properly? Because based on the reliability of the website I'm not surprised by this, but it's still hilarious.

jedberg · on Feb 27, 2020

No it is not just concatenated.

zAy0LfpBZLC8mAC · on Feb 27, 2020

Now, you have ammended your comment to really say something very different (how is parameterized SQL 'basic' anything when it's actually the correct complete solution to the problem?).

But in any case, this still suggests a complete misunderstanding of the point of that blog post. As far as that blog post's point is concerned, the SQL database is an output of your program. And the whole point is that you need to escape/encode all outputs correctly, but you should not ever sanitize anything.

jedberg · on Feb 27, 2020

Because as others have said, parameterized SQL is a long standing solution, so I consider it pretty basic, but I still consider it a form of sanitization that not everyone does, even though it's a long standing solution.

I think it is totally fair to interpret the article as saying that the database is a program output. And if you interpret it that way, what I said doesn't make sense.

That's just not how I interpreted it.

IshKebab · on Feb 27, 2020

If it's even possible for someone to "find a way to hack your sanitisation" then you're doing it wrong.

anonsivalley652 · on Feb 27, 2020

Yeah, I "LOL" at these type of One True Way™ proclamatory headlines.

What I don't understand is the lack of using proper escaping functions when generating SQL. Templating SQL without escaping is the surest way to a SQL injection.

--- Bad

    "SELECT * FROM USERS WHERE name = '%'" % (name)

--- Good

    "SELECT * FROM USERS WHERE name = '%'" % sq_esc(name)

    # where sq_esc() doesn't add outer ', but escapes anything that needs it

And to defensive coding:

0. Sanitize input. Always, always.

1. Assert pre-condition invariants.*

2. Process.

3. Assert post-condition invariants.*

4. Generate correct output by understanding the output domain.

* Unit tests, smoke tests, integration tests and code-coverage alone are insufficient to cover complex code paths. Fuzzing with asserted invariants is a good way to shake the dust out of hairy code.

DelightOne · on Feb 27, 2020

Defensive coding makes sense.

What is sql templating used for in the face of parameterized queries?