Download free DLP for AI whitepaper


  • There is no comprehensive, reliable functionality in GitHub that will alert you to the exposure of sensitive data.
  • You must publish code that’s free of sensitive information, and you’ve got to do it the hard way.

Pressure leads to shortcuts. Commenting in code is one of those shortcuts and shortcuts can introduce risks. A typical issue with commented code? Confidential data is left exposed. Maybe the programmer intended to edit it out – or simply didn’t think of it as consequential.

Comments with exposed data is not a huge issue when coding is produced by internal teams, and if all code is kept internally. It might well be why coders have an easy attitude to what they stuff into code comments.

The thing is, today, programming doesn’t happen in a silo. Applications are developed by global, off-shore teams. From a commenting point of view, code may be unexpectedly posted in public repositories. Worse, private repositories may not be as secure as you think.

Confidential data exposed in GitHub – time and time again

Think chances are slim that your confidential data will leak out due to a GitHub exposure? You might be surprised. Take a major Canadian telecoms provider, Rogers. In January 2020, a security researcher discovered two public-faced Github accounts that contained application source code which revealed private keys for Roger Communications.

The suspect? A developer who had long left the Canadian company. The researcher, Jason Coulls, made some recommendations in his report but the fact remains that it can be pretty hard to prevent this sort of leak from occurring.

Also beware relying on a private GitHub account for security. GitHub accounts can be compromised, as illustrated by a March 2020 intrusion into a Microsoft employee’s GitHub account which exposed several of Microsoft’s private GitHub repositories. By consequence, any sloppy code in those repositories will also be exposed: including comments with private data.

There’s also the fact that GitHub will have vulnerabilities, just like any other tool. Just in December 2019, the platform announced that it was plugging nine security vulnerabilities. Clearly, you’ve got to be careful with what goes into your GitHub library.

The problem is bigger than it may appear

You might wave the issue off and say that it’s unlikely that your code will contain revealing comments. After all, your team has top-notch programmers that take security seriously, right? That may be true, but everyone slips up sometime.

It becomes significantly more difficult to ensure end to end control over code quality and security when you deploy a remote team. We think that’s one of the biggest reasons why the issues around sensitive information in GitHub is so commonplace.

Don’t just take our word for it. A 2020 analysis by the Unit 42 threat research unit at Palo Alto Networks found that across the 24,000 public repositories analyzed, a whole range of private information was found. The unit found 4,109 config files, 2,464 keys to APIs plus 2,328 hard-coded passwords.

It’s worth reading the full report here – even if just for amusement value, given the clear use of common, guessable passwords. But there is a lesson in there: think your code is watertight? Think again – everyone slips up.

GitHub: addressing the problem, but only partially

To be fair, in May 2020, GitHub revealed a security feature to try and tackle the issue. In theory, GitHub’s new tool will automatically scan private and public code repositories to try and find a limited set of sensitive data. It’s a new version of the GitHub token scanning tool. There are severe limitations on what and how the scan is conducted.

There is no comprehensive, reliable functionality in GitHub that will alert you to the exposure of sensitive data in places like comments simply because it is very tough to consistently identify sensitive data. Furthermore, the mere act of using a private GitHub doesn’t necessarily guarantee your safety either – as we illustrated.

It’s up to you and your team

In other words, there’s no easy way out. You must publish code that’s free of sensitive information, and you’ve got to do it the hard way. Here are four key steps:

·      Trusted coders. You need to hire people you can trust to do the right thing day in, day out. It’s down to selecting the right coders, but also paying them well and maintaining a good relationship with your team.

·      Don’t pressure your coders. It’s all-round good advice: never put programming team members under such pressure that they start taking shortcuts. For a coder, if the choice is between sloppy code, and getting fired, chances are they’ll give you sloppy code.

·      Training. Some coders might not see the harm of including sensitive information in code. Point them to training materials – CWE-540 is a good starting point if you want your coder to take you seriously.

·      Code review. Time-consuming as it may be, there is no replacement for a manual code review. Verify that neither code nor comments contain sensitive information that shouldn’t be there, and do it every single time you publish.

So, there you have it. Sensitive information in code and comments may not be as secure as you think. Take it out from existing code, train your teams to code without exposing sensitive data – and do a code review all the time. Never rely on GitHub security alone for protection.

Polymer is a human-centric data loss prevention (DLP) platform that holistically reduces the risk of data exposure in your SaaS apps and AI tools. In addition to automatically detecting and remediating violations, Polymer coaches your employees to become better data stewards. Try Polymer for free.


Get Polymer blog posts delivered to your inbox.