Frequently Asked Questions

Is the Reporting API secure?

All data stored in BigQuery is encrypted at rest and in transit. This means that when the data is stored on Google's servers, and when it is transmitted between those servers and the client, it is protected by strong encryption.

In addition, BigQuery has built-in access controls that allow you to restrict access to your data based on user roles and permissions. This means that we specify exactly who has access to your data and what actions they are allowed to perform on it.

BigQuery also supports authentication and authorization through standard mechanisms such as OAuth 2.0 and API keys.

Google's infrastructure is designed to protect against common threats such as denial of service attacks, data breaches, and unauthorized access. This is done using various security measures such as firewalls, intrusion detection systems, and regular security audits, and BigQuery API is designed with security in mind and employs a range of measures to ensure that your data is protected at all times.

How often is the data updated?

Daily. The refresh begins at midnight UTC+0 with an max update time of 12 hours. The update covers all data received up to midnight (UTC+0) of the previous day.

How far back does the data go?

All historical data approved for a given organization will be available.

If I am a Direct Access user, how do I connect?

Being in GCP already, it will be seamless. Simply login and attempt to query the relevant tables thru the UI or API.

Insights IAM UI

If I am a non-GCP API user, how do I connect?

CitrusAd will provide you with the relevant credentials in a JSON you can incorporate into your authentication mechanism.

Can I debug queries using the Reporting API only (not BigQuery UI)?

Yes, the BigQuery API will return a code indicating if there was an issue and error messages will also be available.

Given there is a quota in place for non-GCP API users, can I estimate how expensive a query will be?
Yes, the API has a mechanism to get a byte estimate of what the query will scan if executed. You can use that estimate multiplied by how often you call the query to understand how close you will get to the quota.

More info can be found in the GCP documentation.

Dry run query | BigQuery | Google Cloud

What if I exceed the quota limit?

Please refer to your agreement with CitrusAd to understand what your quota is. If not specifically defined, it will default to 10TB of query data scans per month.

Your agreement may also involve a maximum number of API calls per day. If that is not specifically defined, it will default to 100 API calls per day.

In the event that you do exceed your quota (either data scan or number of calls), we will be in touch to understand your use cases. Overage costs may apply depending on your agreement.

In the event of major misuse outside the terms of your agreement (or default limits), we reserve the right to suspend access.

What is an example of using the Reporting API?

Below are some examples using common methods.

Google SDK for Python

This example will:

  1. Connect to BigQuery
  2. Run the query
  3. Spool the result to a csv
import google.cloud.bigquery as bq
import pandas as pd
bq_client = bq.Client.from_service_account_json("<REPLACE>.json")
job_config = bq.QueryJobConfig(allow_large_results=True)
query_job = bq_client.query(
    'SELECT count(1) FROM insight-platform-external-iam.<REPLACE>_insight_reporting.campaign 
    LIMIT 1000', job_config=job_config)
df = query_job.to_dataframe(create_bqstorage_client=False)
df.to_csv(r"C:\Users\<REPLACE>\<REPLACE>.csv", index=False)
print("Run Complete")

Other methods are available for estimating bytes scanned etc before running the query.

Please refer to the BigQuery documentation

BigQuery API | Google Cloud

If not in GCP, can reference a JSON credential file via an environment variable.

Generic API for Python

import csv
import requests
from google.oauth2 import service_account

PROJECT_ID = "insight-platform-external-iam"
DATASET = "<YOUR DATASET HERE>"
END_POINT = f"https://bigquery.googleapis.com/bigquery/v2/projects/{PROJECT_ID}/queries"
QUERY = f"""
SELECT supplier_id, campaign_id, sum(ad_spend) as ad_spend, sum(clicks) as clicks
FROM `{PROJECT_ID}.{DATASET}.realised_ad_agg` 
WHERE ingressed_at BETWEEN '2022-09-01' and '2022-12-31'
group by 1,2
"""

def get_token():
    # With service account
    credentials = service_account.Credentials.from_service_account_file('./secrets/service-account.json')
    scoped_credentials = credentials.with_scopes(['https://www.googleapis.com/auth/cloud-platform'])

    # Do token request
    def req( method, url, headers, body, **kwargs):
        resp = requests.post(url, headers=headers, data=body)
        return type('obj', (object,), {'data' : resp.text, 'status': 200})

    scoped_credentials.refresh(req)
    return scoped_credentials.token

def run_job(token):
    resp = requests.post(
            END_POINT,
            json={
                "query": QUERY,
                "useLegacySql": False
            },
            headers={
                "Content-Type": "application/json",
                "Authorization": f"Bearer {token}"
            }
    )
    return resp.json()['jobReference']['jobId']

def get_query_results(job_id, token):
    status_endpoint = f'{END_POINT}/{job_id}?location=australia-southeast1'
    completed = False
    while not completed:
        response = requests.get(status_endpoint, headers={
                "Content-Type": "application/json",
                "Authorization": f"Bearer {token}"
            })
        completed = response.json()['jobComplete']
    
    data = response.json()
    rows = data['rows']
    columns = [c['name'] for c in data['schema']['fields']]
    
    return rows, columns

def extract():
    token = get_token()
    job_id = run_job(token)
    rows, columns = get_query_results(job_id, token)

    with open('results.csv', 'w', newline='') as f:
        writer = csv.writer(f)
        writer.writerow(columns)

        for row in rows:
            writer.writerow([i['v'] for i in row['f']])

extract()

What if I am in AWS etc and not Google Cloud, can I still authenticate and use the API?

Yes, it will work. We will provide service account credentials and you can reference them in your application. Here is an example.

# TODO(developer): Set key_path to the path to the service account key 
#                  file.
# key_path = "path/to/service_account.json"
 
credentials = service_account.Credentials.from_service_account_file(
    key_path, scopes=["https://www.googleapis.com/auth/cloud-platform"],
)
 
token = credentials.token
 
# use the token to do the API calls
# ...
# headers: Bearer ${token}
# ...

How can I determine what is the location of each dataset being shared with me?

This API call will tell you what location each dataset is in.

GET https://bigquery.googleapis.com/bigquery/v2/projects/insight-platform-external-iam/datasets

{
  "kind": "bigquery#datasetList",
  "etag": "RLU1Ww9C5FdhlcIuRHjW0A==",
  "datasets": [
    {
      "kind": "bigquery#dataset",
      "id": "insight-platform-external-iam:acme_insight_reporting",
      "datasetReference": {
        "datasetId": "acme_insight_reporting",
        "projectId": "insight-platform-external-iam"
      },
      "location": "australia-southeast1"
    },
    {
      "kind": "bigquery#dataset",
      "id": "insight-platform-external-iam:acme_acme_analytics",
      "datasetReference": {
        "datasetId": "acme_acme_analytics",
        "projectId": "insight-platform-external-iam"
      },
      "location": "us-central1"
    }
  ]
}

Any best practice tips?

Generally speaking, if you intend to be a heavy user of the data, particularly if you have access to the unaggregated data (requests/realised ads/orders/enhanced attribution etc), it is best to copy (stage) the tables into your own data warehouse, THEN implement queries for your required business logic on those copies.

Lighter users may just choose to query the tables directly for specific results.

It is important to remain below the quota allowed to ensure smooth functioning.

Also note, that each query can download 1 GB maximum, else an error message will be received. In the event that a very large download is needed, run a number of smaller queries instead (eg subset of data per day or supplier etc).

What if I need help with crafting appropriate SQL statements?

Raise a ticket specifying your attempted query and we can help review it - we’ll get back to you with any comments we may have.

Any tips for using the Pandas package?

Pandas is one of the most popular analytics tools. To make it work, dependencies pandas-gbq and pydata-google-auth need to be installed.

The snippet below is a working example of how to read data from a BigQuery table.

import pandas as pd
from google.oauth2 import service_account

credentials = service_account.Credentials.from_service_account_file('path/to/the/credential/file')

query = 'select * from project.dataset.table'

dat = pd.read_gbq(
    query,
    project_id='project_id',
    credentials=credentials
)

More information about the Pandas function can be found here.

Any tips for using the PySpark package?

Assuming you have a working PySpark environment, you need to provide the correct jar file for the BigQuery connector appropriate to your PySpark version. For instance, PySpark 3.2.* requires spark-3.2-bigquery-0.30.0.jar. The list of the jar files as well as practical code snippets and parameters can be found here.

The code snippet below provides an example of how to run a query.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('BigNumeric').config('spark.jars', 'spark-3.2-bigquery-0.30.0.jar').getOrCreate()

spark.conf.set('credentialsFile', 'path/to/the/credential/file'')

spark.conf.set('viewsEnabled, 'true')
spark.conf.set('materializationProject', 'yourMaterializationProject')
spark.conf.set('materializationDataset', 'yourMaterializationDataset')

query = 'select * from project.dataset.table'

df = spark.read.format('bigquery').option('query', query).load()

df.show()

IMPORTANT: The parameter viewsEnabled must be true.

Data in views is materialised in temporary tables before read by PySpark where bigquery.tables.create permission is needed. Therefore you need to provide the materializationProject and materializionDataset where the user does have write access.

I’m getting an error demanding a filter in the query?

For a partitioned table a filter is mandatory, without which an error message like the one below will be thrown:

Cannot query over table ‘dataset_id.table_id' without a filter over column(s) ‘partitioned_column' that can be used for partition elimination

To resolve the error, simply add a reasonable filter covering the target range, e.g.

-- this query returns all records available since yesterday
select
  * 
from
  dataset_id.table_id
where
  ingressed_at >= date_sub(current_date, interver 1 day)

To find out which column the table is partitioned against (like the ingressed_at in the example above), refer to the given table’s description.

How can I request access?

Process

A ticket must be raised and eligibility criteria agreed to in writing.

We will work with the potential Candidate to identify the required level of access and security settings and determine what quotas and costs may apply.

Eligibility Criteria

A Candidate must comply with the following Criteria to be considered eligible for Reporting API access: -

General

  1. The Candidate can only request access to data for CitrusAd namespaces and teams they otherwise are already members of, or, have general access to. The Candidate must specify which of the following scenarios they are applying for (and provide evidence of existing access):
    1. Environment level (an entire implementation of CitrusAd platform is dedicated to the Candidate).
    2. Namespace level (the Candidate has permission to see all teams, both retailer and supplier, within an individual Namespace or list of Namespaces).
    3. Specific Retailer Team id or group level.
    4. Specific Supplier Team id or group level.
      1. additionally, an Integrator can access a specific Supplier Team id or group level PLUS full Retailer Product Catalogs where agreed to by a Retailer on a case by case basis.
  2. Transactional Fact data can only be provided to Candidates that are eligible for General Criteria 1a or 1b.
  3. Candidates that do not qualify for transactional Fact data will be provided access to pre-aggregated Fact data only. The data will be aggregated into daily summaries (with UTC+0 as the time zone of aggregation).
  4. Candidates eligible for General Criteria 1d only, cannot receive Ad Request data (as opposed to Realised Ad data which will be provided). Product data will be provided for the specific Products being advertised in the Realised Ads by the Supplier EXCEPT for Integrators who may receive Retailer Product Catalogs where agreed to by a Retailer on a case by case basis.
  5. Dimensional data is only guaranteed to include the current versions of the records in question. It is expected that tracking of historical changes be implemented by the Candidate as they require.
  6. Data is updated daily and will be updated no later than 12:00 UTC+0 for data up to and including the previous completed UTC+0 day.
  7. It is understood that the access is read only in nature. The API is not to be used to create objects in our data warehouse for any purpose.
  8. Any blending with other data sources must be done in the Candidate's own environment.
  9. The Candidate must have an SDK (or equivalent) available for accessing Google BigQuery API.
  10. The Candidate has good knowledge of SQL.
  11. The Candidate will be familiar with CitrusAd concepts, and if not, will make arrangements for standard product training to be provided through their Customer Support Manager or Technical Account Manager.
  12. Based on documents provided, the Candidate is expected to develop their own solutions. If an issue is found with an SQL not behaving as expected according to the documentation, a ticket must be raised through regular support channels. The following information must be provided.
    1. The account through which the connection is being made.
    2. The exact SQL being called.
    3. A detailed description of what error messages occur.
  13. If the Candidate is a Retailer, it is required that Impressions/Clicks/Orders be provided to the CitrusAd platform such that a complete picture of the Ad lifecycle can be established.
  14. From time to time, CitrusAd reserve the right to alter the schema. These changes typically involve the addition of new columns to existing tables and views and would be backward compatible. Candidates must structure their SQL to name columns rather than use wildcards etc. In the event that a change involves deprecating a column or table, CitrusAd will give at least 12 weeks notice of the change before it is implemented. Notifications will occur via standard release communications made to platform users.

It is not mandatory for a Candidate to be an existing Google Cloud Platform (GCP) user; however, further Criteria exist depending on whether or not the Candidate is a GCP or non-GCP user.

Non-GCP Candidate

Unless agreed otherwise, CitrusAd will provide credentials for a single service account within our environment to the Candidate.

Unless agreed otherwise, the following default conditions apply:

  1. A maximum of 100 API calls per day.
  2. No more than 10TB of data scans per month (note, the API has a way of estimating the query scan size prior to execution, refer to Google documentation Dry run query | BigQuery | Google Cloud).
  3. If Non GCP Candidate Criteria 1 and/or 2 are exceeded, CitrusAd reserve the right to suspend access at our discretion alone.
  4. No individual API call can download more than 1GB of data at a time.

GCP Candidate

Unless agreed otherwise, the Candidate will provide CitrusAd details of no more than 5 GCP accounts so we can assign the required access.

Note that the account must have BigQuery Job User role (roles/bigquery.jobUser) assigned.

The following restrictions apply:

  1. A maximum of 100 API calls per day.
  2. If GCP Candidate Criteria 1 is exceeded, CitrusAd reserve the right to suspend access at our discretion alone.
  3. No individual API call can download more than 1GB of data.

Glossary

Environment

The name of the physical environment that the CitrusAd platform is deployed in. Each hosts one or more namespaces.

Namespace

A logical grouping of all entities that are part of an implementation of the CitrusAd solution. This includes teams and all objects owned by teams. Typically, a namespace may consist of a retailer (team) and multiple suppliers (teams) together with users for each team and other related config (retailers own catalogs, suppliers configure campaigns etc). Teams (and what they own) belong exclusively to a single namespace (no teams can exist across multiple namespaces).

User

Unique identifier of a user in the CitrusAd system. A single email may have multiple userIds. Each userId is unique per namespace. Each user will have a first name, last name, email, and id. A user can be a member of and access multiple teams in the CitrusAd platform.

Team

A team within the CitrusAd system. Can be a supplier (advertiser) or retailer. Supplier teams will typically create campaigns, retailers review campaigns and perform administrative functions. A user in the CitrusAd system can be a member of many teams or only one. A team will have typically users, campaigns and wallets associated with it.

Supplier

A supplier team within the CitrusAd system. A supplier might typically be a brand parent company or a series of teams per individual brand. Suppliers typically maintain campaigns, administer wallet balances etc.

Retailer

A retailer team within the CitrusAd system. Most namespaces will only have one retailer team. Retailers typically maintain product catalogs, review campaigns etc.

Campaign

A single unique campaign configured with a placement and targeting strategy for a specific selection of products. For example a campaign in the CitrusAd system could be promoting product A and B targeting search terms 'chocolate' and 'chocolates' with a max bid of $0.60. A single team typically has many campaigns.

Catalog

A unique retailer's product catalog in the CitrusAd system. It is typical for a retailer to only synchronise one product catalog with CitrusAd in a single namespace. A catalog will have a list of all products in the retailer's catalog, their name, brand, categories and other relevant attributes that are ingested into the CitrusAd system.

Product

A single unique product in the CitrusAd system. A product will have a unique product code synchronised in the product catalog. A product may have attributes such as category, taxonomy, brand etc.

Wallet

A wallet in the CitrusAd system stores an advertiser's funds for the purpose of making payments (eg paying for realised ads). Each wallet has a single currency code and can only spend against catalogs of that same currency code. A wallet is owned by a team. A team can have any number of wallets. A wallet can be archived. Archiving a wallet will only show/hide it in the platform, an archived wallet can still spend credits.

Ledger

A ledger of events that have resulted in a transaction in the CitrusAd system. This is most commonly ad events such as impressions or clicks for sponsored products or banner ads (resulting in a debit). This can also be top ups and adjustments to balances by a supplier (credits). Each event will have a 'reason' such as Sponsored Products, Banner Ads, Top Up.

Request

A request made to the CitrusAd system for ads. In the request the retailer specifies a placement as well as context such as a customer's sessionId or filters relevant to the request. Depending on the request, CitrusAd will send back ads of a relevant AdType (eg Category or Searchterm) to the retailer to render to the customer.

(Realised) Ad

An ad is a single ad event sent back to a retailer to serve to their customer. It becomes a realised ad when the retailer returns back confirmation that the ad has been impressioned at least (explicit confirmation that the ad was actually used, ie realised) . In the CitrusAd system, every ad will have a unique realisedad id which is a reference for that single unique event.

Category

A category is a page on the retailer's site as part of their website's taxonomy such as 'Bakery' or 'Dairy'. A retailer typically requests ads on a category page and specifies this relevant attribute in their request to CitrusAd. If CitrusAd has active and valid campaign's for the Category, ads will be returned.

SearchTerm

A search term entered by a customer onto the retailer's website. This search term is then sent to CitrusAd to request relevant ads. If CitrusAd has active and valid campaign's for the search term, ads will be returned.

Order

A unique order in the retailer's system synchronised with CitrusAd. A single order can have multiple order items within it (much like a customer's cart can hold multiple items). Once a customer's order is complete, these are sent to CitrusAd to power CitrusAd's attribution. Return on ad spend (ROAS) and other important KPIs can then be provided to retailers and advertisers.

Attribution

Attribution is a process operated in the CitrusAd system that assigns ads served to a customer to an order submitted. A typical customer journey would be that they see an ad (impression), click on it (click), add it to their cart, and purchase that item (conversion). The order is 'attributed' to the unique ad that the customer clicked. For an order to be attributed in the CitrusAd system, the ad needs to be interacted with (either seen or clicked subject to integration), and the customer has then purchased an item relevant to the ad. CitrusAd typically use a 'sessionId' to attribute orders to ads, where the retailer specifies a 'sessionId' in all relevant touchpoints of an ads journey. This is how CitrusAd is able to identify that a single ad, served to a single customer, has resulted in a specific order.

Dates

All data are cast into the UTC+0 timezone if aggregated.

Cap

Implementations of the CitrusAd platform often involve the retailer requesting more ads than would realistically ever be impressioned (realised). From an analytics perspective, this may give an inaccurate impression of how certain metrics are really performing.
For example, if a request was made for 20 ads (AdType=Product), and the platform served 2 ads in response, then that represents a 'fill rate' of 10% on the request (2 out of 20). However, if it is understood that in practice only 4 ads are likely to ever be used (realised), it would be preferable to interpret that as 50% filled (2 out of 4).
Hence the notion of capping within reporting.
Capping is set per retailer, with one cap available for product ads and another for banner ads (since product ad requests typically will request, and use, far more ads than banners).
Returning to the example, if the product cap = 4 for the retailer, then the request metrics would report as follows:-
NumAdRequests = 1
NumAdsRequested = 20
CappedNumAdsRequested = 4
NumAdsServed = 2
CappedNumAdsServed = 2
Note, in the event that 5 ads were served (ie ads served exceeded the cap itself), the last 2 metrics would report as:-
NumAdsServed = 5
CappedNumAdsServed = 4 (trimmed back to the cap)
Caps are not mandatory. In the event they are not specified, then capped and uncapped results would be the same.

Enhanced Attribution

The CitrusAd platform performs attributions as outlined in the Attribution section (see above).
The reporting subsystem can also detect and flag other attribution scenarios depending on the retailer (enhanced attribution).

The scenarios are:

  • Impression View Thru Attribution
    • An Order was attributed to an Ad that was viewed for the same Product in the same session id (ie was an impression and not a click).
  • Halo Click Attribution
    • An Order was attributed to an Ad that was clicked for a Product belonging to the same Halo level in the same session id. The most common halo level is Brand (ie the Product of the Ad and the Product of the Order are different, but belong to the same Brand). Other halo types are possible depending on implementation. For example, the halo may be more specific and require that the Ad and Order be for Products that have a common Category in addition to a common Brand. Retailer taxonomy set per Product in the Catalog is used to define this extra level of detail in the Halo.

Version: 96e63c0