Handle with Care — Using Data Responsibly

Jonathan Geggatt
Build Tonight
Published in
7 min readAug 15, 2018

--

We’ve reached an inflection point in our society’s ability to use data intelligently, quickly and to a deeper level. With advances in cloud computing powering an army of data scientists wielding jupyter notebooks and ensemble models, any hypothesis can be tested and enhanced faster than you can say “geospatial index”.

With great data processing power, however, comes great responsibility. It’s no longer enough to rely on on-premise databases and VPNs to handle data security, nor is it sufficient to rely solely on a code of conduct to ensure your own employees aren’t using your data in nefarious ways. When you collect information about the way a customer interacts with your app, you owe it to them to safeguard their data and prevent it from getting into the wrong hands.

The consequences for getting this wrong are dire. Whether it’s data breaches that expose the most sensitive data you’re capturing, platform exploits that allow bad actors to harvest customer information, or lawsuits stemming from legislative changes, the threats to businesses are pervasive and ever-changing. Getting ahead of security and enabling a culture of responsibility are crucial to succeeding in today’s data environment.

The Problem

Data-driven companies have always had to battle the trade-off between privacy and usefulness. Classically, the scales have tipped overwhelmingly towards usefulness — store the data first, and worry about security later (if ever). However, spurred by the European Union passing stringent data protection laws in the form of GDPR (General Data Protection Regulation), companies have had to get serious about the systems and processes that hold this critical user data.

At HotelTonight, we believe that we can build an app that serves our customers, without compromising on privacy.

We’ve been finding ways to surface information to our employees and foster a data-driven culture since day one. In the beginning, this mostly took the form of custom tools built into our extranet (we call it HTx), which serves as the main hub for employees and hotels to manage everything from our market structure, to the rates and allotments that hotels are loading. Along the way, we built tools to allow our Supply Team to monitor the performance of their markets and hotels, and the Customer Experience team to gather information about how the customer that they’re engaging with has used our app.

As typically happens when reporting is mixed into the same systems that power a customer-facing app, this often added unnecessary (aka avoidable) pressure on our database and servers, and some sleepless nights for our platform team. We knew we had to reduce that load on our monolithic “main app”, so we added a data warehouse layer in order to power both regular and ad-hoc reporting and analysis. Over time, we’ve moved several critical reports onto business intelligence tools that query against our warehouse, but many operational processes still rely on our employees accessing and interacting with our extranet.

This hybrid architecture (extranet plus data warehouse) means we had to approach data security on two separate fronts. Regardless of the attack vector, however, the mission was clear: give our employees the ability to understand how our customers use our app, while only exposing the personal details of that customer to employees who need that information to do their job.

Securing the Extranet

We started by focusing on the extranet first, since the number of use cases to be addressed was much higher than on our data warehouse. Even though there was already a system in place to manage roles and access levels, it was obvious that we were going to need finer-grained controls in order to meet our privacy goals. By analyzing usage patterns in our logs and event data, we were able to refine the role definitions such that all of the various use cases were covered, along with the people that should be included in those roles.

The next step was to identify all of the places that personally identifiable information (PII) could show up in the extranet. Some of these were obvious, like the customer page, where customer info is presented along with their booking history. Others, however, were more ambiguous — for instance, the hotel reviews page, where we can see how people rated their stay at a hotel and any comments they may have left.

Furthermore, there were several cases where a role should be able to access a page in the extranet, but specific parts of that page should still be hidden. For example, our marketing email team needs to be able to view a customer’s email address, but they don’t need to see details about payment methods or transaction history. This type of conditional access required us to build new security primitives into our access control framework, allowing us to be super granular about every piece of data we show on a screen.

Once we had identified all the places where customer information was presented, we went through the effort of hiding that data, and then granting it back on a conditional basis (if the user was associated with a role that needed to access that data).

Rolling this out required extensive communication throughout the company, since we were aiming to not disrupt any of the day to day processes that our employees needed to run the business. When we got it wrong, we were quick to grant the necessary access back to the employee, and use that information to hone our roles and their access points.

The end result of this added more complexity to our extranet (product managers and engineers now have to consider the implications of data security every time they make a change), but it’s a necessary reality in the world today.

Securing the Warehouse

Solving the user security problem in the data infrastructure layer was more straightforward than the extranet, mostly due to the fact that very few critical processes rely on pulling PII from the warehouse. In addition, we benefited from a thoughtful data model that has always been a focus of our data team — hiding a piece of PII like a customer’s email address meant we had to hide it only in our “customers” table, and it would propagate from there throughout the rest of our data warehouse.

Our data infrastructure consists of a data warehouse (powered by Snowflake), with two BI tools sitting on top (Looker and Tableau) that all employees have access to. The vast majority of employee data access happens through the BI tools, whether they’re viewing pre-built dashboards and reports in Tableau, or doing their own exploratory analysis in Looker. For analysts and a handful of employees that possess technical skills, we allow access to query against Snowflake directly.

This means that for 95% of our data access, we could protect user data by simply preventing any PII from ever making its way into the BI tools. For the few processes that did require PII coming from Looker, we moved those to custom queries running directly in Snowflake.

To control PII access in Snowflake, we came up with a process that identifies PII columns across our data warehouse and the people that need access to them. Having a list of problem areas is a good start, but actually implementing a solution is much more challenging. How do you control, at a table/row/column level, the data that someone is able to view?

We came up with an innovative solution that leverages Snowflake’s concept of “schemas” and capitalizes on efforts we’d made to funnel all of our BI tools and analysts into a central “public” schema. We defined an “insecure” version of the public schema (where PII is obfuscated) and a “secure” version, and then built a process to populate and mask data in those schemas every day. The process first checks against the global security levels and roles tables, and then creates views over the data with all necessary fields either visible or hidden, for each user. To someone querying against the public schema, it just works — you’re able to see the values in a column if you have access to it, otherwise you just see NULL. If we ever determine that we got the access levels wrong, or need to hide a new piece of data, we simply need to update some configuration tables and re-run the security process, and the data is covered.

Securing the Future

As HotelTonight continues to grow and find new ways to understand and optimize the user experience, it’s imperative that we continue to monitor and evaluate these processes that we’ve put in place. That means auditing attack vectors, ensuring that employees are properly onboarded and offboarded, and treating new data/API/SDK integrations with a skeptical eye regarding their usage of customer data. We’ve begun adding data processing clauses to our vendor agreements, and also scrubbing unnecessary PII out of the data that we send to external partners.

To that end, it’s important to foster a culture of respect and boundaries when it comes to user data. Keeping user data private has to be a first-order priority, not an afterthought that gets tacked on at the end of a project. Getting this message across to our employees and our systems implemented has been a long process, but we are committed to doing everything we can to keep our users’ data private and safe.

--

--