Wednesday, January 8, 2025

Ingest data from Google Analytics 4 and Google Sheets to Amazon Redshift using Amazon AppFlow | Amazon Web Services

Must read

Google Analytics 4 (GA4) provides valuable insights into user behavior across websites and apps. But what if you need to combine GA4 data with other sources or perform deeper analysis? That’s where Amazon Redshift and Amazon AppFlow come in. Amazon AppFlow bridges the gap between Google applications and Amazon Redshift, empowering organizations to unlock deeper insights and drive data-informed decisions. In this post, we show you how to establish the data ingestion pipeline between Google Analytics 4, Google Sheets, and an Amazon Redshift Serverless workgroup.

Amazon AppFlow is a fully managed integration service that you can use to securely transfer data from software as a service (SaaS) applications, such as Google BigQuery, Salesforce, SAP, HubSpot, and ServiceNow, to Amazon Web Services (AWS) services such as Amazon Simple Storage Service (Amazon S3) and Amazon Redshift, in just a few clicks. With Amazon AppFlow, you can run data flows at nearly any scale and at the frequency you choose—on a schedule, in response to a business event, or on demand. You can configure data transformation capabilities such as filtering and validation to generate rich, ready-to-use data as part of the flow itself, without additional steps. Amazon AppFlow automatically encrypts data in motion, and allows you to restrict data from flowing over the public internet for SaaS applications that are integrated with AWS PrivateLink, reducing exposure to security threats.

Amazon Redshift is a fast, scalable, and fully managed cloud data warehouse that allows you to process and run your complex SQL analytics workloads on structured and semi-structured data. It also helps you securely access your data in operational databases, data lakes, or third-party datasets with minimal movement or copying of data. Tens of thousands of customers use Amazon Redshift to process large amounts of data, modernize their data analytics workloads, and provide insights for their business users.

Prerequisites

Before starting this walkthrough, you need to have the following prerequisites in place:

  • An AWS account.
  • In your Google Cloud project, you’ve enabled the following APIs:
    • Google Analytics API
    • Google Analytics Admin API
    • Google Analytics Data API
    • Google Sheets API
    • Google Drive API

For more information, refer to Amazon AppFlow support for Google Sheets.

For the steps to enable these APIs, see Enable and disable APIs on the API Console Help for Google Cloud Platform.

Architecture overview

The following architecture shows how Amazon AppFlow can transform and move data from SaaS applications to processing and storage destinations. Three sections appear from left to right in the diagram: Source, Move, Target. These sections are described in the following section.

  • Source – The leftmost section on the diagram represents different applications acting as a source, including Google Analytics, Google Sheets, and Google BigQuery.
  • Move – The middle section is labeled Amazon AppFlow. The section contains boxes that represent Amazon AppFlow operations such as Mask Fields, Map Fields, Merge Fields, Filter Data, and others. In this post, we focus on setting up the data movement using Amazon AppFlow and filtering data based on start date. The other transformation operations such as mapping, masking, and merging fields are not covered in this post.
  • Destination – The section on the right of the diagram is labeled Destination and represents targets such as Amazon Redshift and Amazon S3. In this psot, we primarily focus on Amazon Redshift as the destination.

This post has two parts. The first part covers integrating from Google Analytics. The second part focuses on connecting with Google Sheets.

Application configuration in Google Cloud Platform

Amazon AppFlow requires OAuth 2.0 for authentication. You need to create an OAuth 2.0 client ID, which Amazon AppFlow uses when requesting an OAuth 2.0 access token. To create an OAuth 2.0 client ID in the Google Cloud Platform console, follow these steps:

  1. On the Google Cloud Platform Console, from the projects list, select a project or create a new one.
  2. If the APIs & Services page isn’t already open, choose the menu icon on the upper left and select APIs & Services.
  3. In the navigation pane, choose Credentials.
  4. Choose CREATE CREDENTIALS, then choose OAuth client ID, as shown in the following screenshot.

  1. Select the application type Web application, enter the name demo-google-aws, and provide URIs for Authorized JavaScript origins https://console.aws.amazon.com. For Authorized redirect URIs, add https://us-east-1.console.aws.amazon.com/appflow/oauth. Choose SAVE, as shown in the following screenshot.

  1. The OAuth client ID is now created. Select demo-google-aws.

  1. Under Additional information, as shown in the following screenshot, note down the Client ID and Client secret.

Data ingestion from Google Analytics 4 to Amazon Redshift

In this section, you configure Amazon AppFlow to set up a connection between Google Analytics 4 and Amazon Redshift for data migration. This procedure can be classified into the following steps:

  1. Create a connection to Google Analytics 4 in Amazon AppFlow
  2. Create an IAM role for Amazon AppFlow integration with Amazon Redshift
  3. Set up Amazon AppFlow connection for Amazon Redshift
  4. Set up table and permission in Amazon Redshift
  5. Create data flow in Amazon AppFlow

Create a connection to Google Analytics 4 in Amazon AppFlow

To create a connection to Google Analytics 4 in Amazon AppFlow, follow these steps:

  1. Sign in to the AWS Management Console and open Amazon AppFlow.
  2. In the navigation pane on the left, choose Connections.
  3. On the Manage connections page, for Connectors, choose Google Analytics 4.
  4. Choose Create connection.
  5. In the Connect to Google Analytics 4 window, enter the following information. For Client ID, enter the client ID of the OAuth 2.0 client ID in your Google Cloud project created in the previous section. For Client secret, enter the client secret of the OAuth 2.0 client ID in your Google Cloud project created in the previous section.
  6. (Optional) under Data encryption, choose Customize encryption settings (advanced) if you want to encrypt your data with a customer managed key in AWS Key Management Service (AWS KMS). By default, Amazon AppFlow encrypts your data with an AWS KMS key that AWS creates, uses, and manages for you. Choose this option if you want to encrypt your data with your own AWS KMS key instead.

The following screenshot shows the Connect to Google Analytics 4 window.

Amazon AppFlow encrypts your data during transit and at rest. For more information, see Data protection in Amazon AppFlow.

If you want to use an AWS KMS key from the current AWS account, select this key under Choose an AWS KMS key. If you want to use an AWS KMS key from a different AWS account, enter the Amazon Resource Name (ARN) for that key:

  1. For Connection name, enter a name for your connection
  2. Choose Continue
  3. In the window that appears, sign in to your Google account and grant access to Amazon AppFlow

On the Manage connections page, your new connection appears in the Connections table. When you create a flow that uses Google Analytics 4 as the data source, you can select this connection.

Create an IAM role for Amazon AppFlow integration with Amazon Redshift

You can use Amazon AppFlow to transfer data from supported sources into your Amazon Redshift databases. You need an IAM role because Amazon AppFlow needs authorization to access Amazon Redshift using an Amazon Redshift Data API.

  1. Sign in to the AWS Management Console, preferably as admin user, and in the navigation pane of the IAM dashboard, choose Policies.
  2. Choose Create policy.
  3. Select the JSON tab and paste in the following policy. Amazon AppFlow needs the following permissions to gain access and run SQL statements with the Amazon Redshift database.
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DataAPIPermissions",
      "Effect": "Allow",
      "Action": [
        "redshift-data:ExecuteStatement",
        "redshift-data:GetStatementResult",
        "redshift-data:DescribeStatement"
      ],
      "Resource": "*"
    },
    {
      "Sid": "GetCredentialsForAPIUser",
      "Effect": "Allow",
      "Action": "redshift:GetClusterCredentials",
      "Resource": [
        "arn:aws:redshift:*:*:dbname:*/*",
        "arn:aws:redshift:*:*:dbuser:*/*"
      ]
    },
    {
      "Sid": "GetCredentialsForServerless",
      "Effect": "Allow",
      "Action": "redshift-serverless:GetCredentials",
      "Resource": "*"
    },
    {
      "Sid": "DenyCreateAPIUser",
      "Effect": "Deny",
      "Action": "redshift:CreateClusterUser",
      "Resource": [
        "arn:aws:redshift:*:*:dbuser:*/*"
      ]
    },
    {
      "Sid": "ServiceLinkedRole",
      "Effect": "Allow",
      "Action": "iam:CreateServiceLinkedRole",
      "Resource": "arn:aws:iam::*:role/aws-service-role/redshift-data.amazonaws.com/AWSServiceRoleForRedshift",
      "Condition": {
        "StringLike": {
          "iam:AWSServiceName": "redshift-data.amazonaws.com"
        }
      }
    }
  ]
}

  1. Choose Next, provide the Policy name as appflow-redshift-policy, Description as appflow redshift policy, and choose Create policy.

  1. In the navigation pane, choose Roles and Create role. Choose Custom trust policy and paste in the following. Choose Next. This trust policy grants Amazon AppFlow the ability to assume the role for Amazon AppFlow to access and process data.
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "appflow.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

  1. Search for policy appflow-redshift-policy, check the box next to it, and choose Next.

  1. Provide the role name appflow-redshift-access-role and Description and choose Create role.

Set up Amazon AppFlow connection for Amazon Redshift

To set up an Amazon AppFlow connection for Amazon Redshift, follow these steps:

  1. On the Amazon AppFlow console, in the navigation pane, choose Connectors, select Amazon Redshift, and choose Create connection.

  1. Enter the connection name appflow-redshift-connection. You can either use Amazon Redshift provisioned or Amazon Redshift Serverless, but in this example we are using Amazon Redshift Serverless. Select Amazon Redshift Serverless and enter the workgroup name and database name.
  2. Choose the S3 bucket and enter the bucket prefix.

  1. For Amazon S3 access, select the IAM role attached to the Redshift cluster or namespace during the creation of the Redshift cluster. Additionally, for the Amazon Redshift Data API, choose the IAM role appflow-redshift-access-role created in the previous section and then choose

Set up a table and permission in Amazon Redshift

To set up table and permission in Amazon Redshift, follow these steps:

  1. On the Amazon Redshift console, choose Query editor v2 in Explorer.
  2. Connect to your existing Redshift cluster or Amazon Redshift Serverless workgroup.
  3. Create a table with the following Data Definition Language (DDL).
create table public.stg_ga4_daily_summary

(

event_date date,

region varchar(255),

country varchar(255),

city varchar(255),

deviceCategory varchar(255),

deviceModel varchar(255),

browser varchar(255),

active_users INTEGER,

new_users integer,

total_revenue  NUMERIC(18,2)

);

The following screenshot shows the successful creation of this table in Amazon Redshift:

The following step is only applicable to Amazon Redshift Serverless. If you are using a Redshift provisioned cluster, you can skip this step.

  1. Grant the permissions on the table to the IAM user used by Amazon AppFlow to load data into Amazon Redshift Serverless, for example, appflow-redshift-access-role.
GRANT INSERT ON TABLE public.stg_ga4_daily_summary TO "IAMR:appflow-redshift-access-role";

Create data flow in Amazon AppFlow

To create a data flow in Amazon AppFlow, follow these steps:

  1. On the Amazon AppFlow console, choose Flows and select Amazon Redshift. Choose Create flow and enter the flow name and the flow description, as shown in the following screenshot.

  1. In Source name, choose Google Analytics 4. Choose the Google Analytics 4 connection.
  2. Select the Google Analytics 4 object, then choose Amazon Redshift as the destination, selecting the public schema and stg_ga4_daily_summary table in your Redshift instance.

  1. For Flow trigger, choose Run on demand and choose Next, as shown in the following screenshot.

You can run the flow on schedule to pull either full or incremental data refresh. For more information, see Schedule-triggered flows.

  1. Select Manually map fields. From the Source field name dropdown menu, select the attribute date, and from the Destination field name, select event_date and choose Map fields, as shown in the following screenshot.

  1. Repeat the previous step (step 5) for the following attributes and then choose Next. The following screenshot shows the mapping.
Dimension:browser --> browser
Dimension:region --> region
Dimension:country --> country
Dimension:city --> city
Dimension:deviceCategory --> devicecategory
Dimension:deviceModel --> devicemodel
Metric:activeUsers --> active_users
Metric:newUsers --> new_users
Metric: totalRevenue --> total_revenue
Dimension:date --> event_date

The Google Analytics API provides various dimensions and metrics for reporting purposes. Refer to API Dimensions & Metrics for details.

  1. In Field name, enter the filter start_end_date and choose Next, as shown in the following screenshot. The Amazon AppFlow date filter supports both a start date (criteria1) and an end date (criteria2) to define the desired date range for data transfer. We are using the date range because we have sample data created for this range.

  1. Review the configurations and choose Create flow.
  2. Choose Run flow, as shown in the following screenshot, and wait for the flow execution to be completed.

  1. On the Amazon Redshift console, choose Query editor v2 in Explorer.
  2. Connect to your existing Redshift cluster or Amazon Redshift Serverless workgroup.
  3. Enter the following SQL to verify the data in Amazon Redshift.
select * from public.stg_ga4_daily_summary

The screenshot below shows the results loaded into the stg_ga4_daily_summary table.

Data ingestion from Google Sheets to Amazon Redshift

Ingesting data from Google Sheets to Amazon Redshift using Amazon AppFlow streamlines analytics, enabling seamless transfer and deeper insights. In this section, we demonstrate how business users can maintain their business glossary in Google Sheets and integrate that using Amazon AppFlow with Amazon Redshift and get meaningful insights.

For this demo, you can upload the Nation Market segment file to your Google sheet before proceeding to the next steps. These steps show how to configure Amazon AppFlow to set up a connection between Google Sheets and Amazon Redshift for data migration. This procedure can be classified into the following steps:

  1. Create Google Sheets connection in Amazon AppFlow
  2. Set up table and permission in Amazon Redshift
  3. Create data flow in Amazon AppFlow

Create Google Sheets connection in Amazon AppFlow

To create a Google Sheets connection in Amazon AppFlow, follow these steps:

  1. On the Amazon AppFlow console, choose Connectors, select Google Sheets, then choose Create connection.
  2. In the Connect to Google Sheets window, enter the following information. For Client ID, enter the client ID of the OAuth 2.0 client ID in your Google Sheets project. For Client secret, enter the client secret of the OAuth 2.0 client ID in your Google Sheets project.
  3. For Connection name, enter a name for your connection.
  4. (Optional) Under Data encryption, choose Customize encryption settings (advanced) if you want to encrypt your data with a customer managed key in AWS KMS. By default, Amazon AppFlow encrypts your data with an AWS KMS key that AWS creates, uses, and manages for you. Choose this option if you want to encrypt your data with your own AWS KMS key instead.
  5. Choose Connect.
  6. In the window that appears, sign in to your Google account and grant access to Amazon AppFlow.

Set up table and permission in Amazon Redshift

To set up a table and permission in Amazon Redshift, follow these steps:

  1. On the Amazon Redshift console, choose Query editor v2 in Explorer
  2. Connect to your existing Redshift cluster or Amazon Redshift Serverless workgroup
  3. Create a table with the following DDL
create table public.stg_nation_market_segment(
n_nationkey int4 not null,
n_name char(25) not null ,
n_regionkey int4 not null,
n_comment varchar(152) not null,
n_marketsegment varchar(255),
Primary Key(N_NATIONKEY)
) distkey(n_nationkey) sortkey(n_nationkey);

he following steps are only applicable to Amazon Redshift Serverless. If you are using a Redshift provisioned cluster, you can skip this step.

  1. Grant the permissions on the table to the IAM user used by Amazon AppFlow to load data into Amazon Redshift Serverless, for example, appflow-redshift-access-role
GRANT INSERT ON TABLE public.stg_nation_market_segment TO "IAMR:appflow-redshift-access-role";

Create data flow in Amazon AppFlow

  1. On the Amazon AppFlow console, choose Flows and select Google Sheets. Choose Create flow, enter the flow name and flow description, and choose Next.
  2. Select Google Sheets in Source name and choose the Google Sheets connection.
  3. Select the Google Sheets object nation_market_segment#Sheet1.
  4. Choose the Destination name as Amazon Redshift, then select stg_nation_market_segment as your Amazon Redshift object, as shown in the following screenshot.

  1. For Flow trigger, select On demand and choose Next.

You can run the flow on schedule to pull full or incremental data refresh. Read more at Schedule-triggered flows.

  1. Select Manually map fields. From the Source field name dropdown menu, select Map all fields directly. When a dialog box pops up, choose the respective attribute values and choose Map fields, as shown in the following screenshot. Choose Next.

The following screenshot shows the mapping.

  1. On the Add Filters page, choose Next.
  2. On the Review and create page, choose Create flow.
  3. Choose Run flow and wait for the flow execution to finish.

The screenshot below shows the execution details of the flow job.

  1. On the Amazon Redshift console, choose Query editor v2 in Explorer.
  2. Connect to your existing Redshift cluster or Amazon Redshift Serverless workgroup.
  3. Run the following SQL to verify the data in Amazon Redshift.
select * from public.stg_nation_market_segment

The screenshot below shows the results loaded into the stg_nation_market_segment table.

  1. Run the following SQL to prepare a sample dataset in Amazon Redshift.
create table public.customer (
c_custkey int8 not null ,
c_name varchar(25) not null,
c_address varchar(40) not null,
c_nationkey int4 not null,
c_phone char(15) not null,
c_acctbal numeric(12,2) not null,
c_mktsegment char(10) not null,
c_comment varchar(117) not null,
Primary Key(C_CUSTKEY)
) distkey(c_custkey) sortkey(c_custkey);

create table public.lineitem (
l_orderkey int8 not null ,
l_partkey int8 not null,
l_suppkey int4 not null,
l_linenumber int4 not null,
l_quantity numeric(12,2) not null,
l_extendedprice numeric(12,2) not null,
l_discount numeric(12,2) not null,
l_tax numeric(12,2) not null,
l_returnflag char(1) not null,
l_linestatus char(1) not null,
l_shipdate date not null ,
l_commitdate date not null,
l_receiptdate date not null,
l_shipinstruct char(25) not null,
l_shipmode char(10) not null,
l_comment varchar(44) not null,
Primary Key(L_ORDERKEY, L_LINENUMBER)
) distkey(l_orderkey) sortkey(l_shipdate,l_orderkey)  ;

create table public.orders (
o_orderkey int8 not null,
o_custkey int8 not null,
o_orderstatus char(1) not null,
o_totalprice numeric(12,2) not null,
o_orderdate date not null,
o_orderpriority char(15) not null,
o_clerk char(15) not null,
o_shippriority int4 not null,
o_comment varchar(79) not null,
Primary Key(O_ORDERKEY)
) distkey(o_orderkey) sortkey(o_orderdate, o_orderkey) ;
copy lineitem from 's3://redshift-downloads/TPC-H/2.18/10GB/lineitem.tbl' iam_role default delimiter '|' region 'us-east-1';
copy orders from 's3://redshift-downloads/TPC-H/2.18/10GB/orders.tbl' iam_role default delimiter '|' region 'us-east-1';
copy customer from 's3://redshift-downloads/TPC-H/2.18/10GB/customer.tbl' iam_role default delimiter '|' region 'us-east-1';

  1. Run the following SQL to do the data analytics using Google Sheets business data classification in the Amazon Redshift dataset.
select
n_marketsegment,
sum(l_extendedprice * (1 - l_discount)) as revenue
from
public.customer,
public.orders,
public.lineitem,
public.stg_nation_market_segment
where
c_custkey = o_custkey
and l_orderkey = o_orderkey
and c_nationkey = n_nationkey
group by
1
order by
revenue desc;

The screenshot below shows the results from the aggregated query in Amazon Redshift from data loaded using Amazon Appflow.

Clean up

To avoid incurring charges, clean up the resources in your AWS account by completing the following steps:

  1. On the Amazon AppFlow console, in the navigation pane, choose Flows.
  2. From the list of flows, select the flow name created and delete it.
  3. Enter “delete” to delete the flow.
  4. Delete the Amazon Redshift workgroup.
  5. Clean up resources in your Google account by deleting the project that contains the Google BigQuery resources. Follow the documentation to clean up the Google resources.

Conclusion

In this post, we walked you through the process of using Amazon AppFlow to integrate data from Google Ads and Google Sheets. We demonstrated how the complexities of data integration are minimized so you can focus on deriving actionable insights from your data. Whether you’re archiving historical data, performing complex analytics, or preparing data for machine learning, this connector streamlines the process, making it accessible to a broader range of data professionals.

For more information, refer to Amazon AppFlow support for Google Sheets and Google Ads.


About the authors

Ritesh Kumar Sinha is an Analytics Specialist Solutions Architect based out of San Francisco. He has helped customers build scalable data warehousing and big data solutions for over 16 years. He loves to design and build efficient end-to-end solutions on AWS. In his spare time, he loves reading, walking, and doing yoga.

Tahir Aziz is an Analytics Solution Architect at AWS. He has worked with building data warehouses and big data solutions for over 13 years. He loves to help customers design end-to-end analytics solutions on AWS. Outside of work, he enjoys traveling and cooking.

Raza Hafeez is a Senior Product Manager at Amazon Redshift. He has over 13 years of professional experience building and optimizing enterprise data warehouses and is passionate about enabling customers to realize the power of their data. He specializes in migrating enterprise data warehouses to AWS Modern Data Architecture.

Amit Ghodke is an Analytics Specialist Solutions Architect based out of Austin. He has worked with databases, data warehouses and analytical applications for the past 16 years. He loves to help customers implement analytical solutions at scale to derive maximum business value.

Latest article