BatCat API

Data Science Environment Setup

batcat.FileSys(project_name=True)[source]

Establish a data science project file system.

Parameters:: project_name (bool) – If a project name is needed, default True.
Yields:: A data science project file system.

batcat.get_logger(logName, logFile=False)[source]

Get a logger in one step.

Logging is one of the most underrated features. Two things (5&3) to take away from Logging in Python: 1. FIVE levels of importance that logs can contain(debug, info, warning, error, critical); 2. THREE components to configure a logger in Python (a logger, a formatter, and at least one handler).

Parameters:

logName (str) – A logger name to display in loggings.
logFile (bool) – A target file to save loggins.

Returns:

A well organized logger.

Return type:

logger (logging.getLogger)

batcat.get_config(file='config.json')[source]

Get configurations.

Parameters:: file (str) – Configuration file, dedault ‘config.json’.
Returns:: A configuration dictionary.
Return type:: config

Simple Storage Service (S3)

batcat.read_csv_from_bucket(bucket, key, encoding=None)[source]

Read CSV from AWS S3.

Parameters:

bucket (str) – Bucket name of S3.
key (str) – Key of S3.

Returns:

Dataframe.

Return type:

df (pandas.DataFrame)

batcat.read_excel_from_bucket(bucket, key, sheet_name=0, header=0)[source]

Read Excel from AWS S3.

Parameters:

bucket (str) – Bucket name of S3.
key (str) – Key of S3.
sheet_name – The target sheet name of the excel.

Returns:

Dataframe.

Return type:

df (pandas.DataFrame)

batcat.save_to_bucket(df, bucket, key)[source]

Save DataFrame to AWS S3.

Parameters:

bucket (str) – Bucket name of S3.
key (str) – Key of S3.
df (pandas.DataFrame) – Dataframe.

Returns:

HTTPS status code.

Return type:

statues (int)

batcat.copy_bucket_files(bucket, prefix, suffix, target_bucket, target_prefix, target_suffix, key_sub)[source]

Parameters:

bucket (str) – Source bucket.
prefix (str) – Prefix of source files.
suffix (str) – Suffix of source files.
target_bucket (str) – Target bucket.
target_prefix (str) – Prefix of target files.
taret_suffix (str) – Suffix of target files.
key_sub (str) – Information to substract from source keys, a tuple.

Returns:

None

batcat.SuccessSignal(bucket, key='.success')[source]

Parameters:

bucet (str) – Target bucket to receive a signal.
key (str) – Signal file.

Returns:

HTTPS status code.

Return type:

statue (int)

Redshift

batcat.get_date_with_delta(delta, format='%Y/%m/%d')[source]

Get the date delta days ago.

Parameters:: delta (int) – The number of days ago.
Returns:: Strftime(‘%Y/%m/%d’).
Return type:: date (str)

batcat.read_data_from_redshift(query, host, password, port=5439, database='dev', user='awsuser', date_start=None, date_end=None)[source]

Read DataFrame from RedShift with host and password.

Parameters:

query (str) – Querry to obtain data from Redshift, str.
host (str) – Redshift configuration.
password (str) – Redshift configuration.
port (str) – Redshift configuration.
database (str) – Redshift configuration.
user (str) – Redshift configuration.
date_start (str) – Date to start, strftime(‘%Y/%m/%d’).
date_end (str) – Date to end, strftime(‘%Y/%m/%d’).

Returns:

target dataframe

Return type:

df (pandas.DataFrame)

batcat.save_df_to_redshift(df, host=None, password=None, port=5439, database='dev', user='awsuser', table_name=None, schema='public', if_exists='append', index=True, index_label=None, chunksize=None, dtype=None, method=None)[source]

Save pd.DataFrame to RedShift with host and password. Refer to pandas.to_sql for more information.

Parameters:

df (pandas.DataFrame) – target dataframe
host (str) – in the form [name].[id].[region].redshift.amazonaws.com
password (str) – Redshift configuration
port (str) – usually 5439
database (str) – Redshift configuration
user (str) – Redshift configuration
table_name (str) – target table name’
schema (str) – Specify the schema (if database flavor supports this). If None, use default schema.
if_exists (str) – How to behave if the table already exists, {‘fail’, ‘replace’, ‘append’}, default ‘append’. (1) fail: Raise a ValueError. (2) replace: Drop the table before inserting new values. (3) append: Insert new values to the existing table.
index (bool) – Write DataFrame index as a column, default True. Uses index_label as the column name in the table.
index_label (str or sequence) – Column label for index column(s), default None. If None is given (default) and index is True, then the index names are used. A sequence should be given if the DataFrame uses MultiIndex.
chunksize (int, optional) – Specify the number of rows in each batch to be written at a time. By default, all rows will be written at once.
dtype (dict or scalar, optional) – Specifying the datatype for columns. If a dictionary is used, the keys should be the column names and the values should be the SQLAlchemy types or strings for the sqlite3 legacy mode. If a scalar is provided, it will be applied to all columns.
method (str) – Controls the SQL insertion clause used: (1)None : Uses standard SQL INSERT clause (one per row). (2)’multi’: Pass multiple values in a single INSERT clause. (3)callable with signature (pd_table, conn, keys, data_iter). Details and a sample callable implementation can be found in the section insert method .

Returns:

None

batcat.read_data_from_redshift_by_secret(secret_name=None, region=None, query=None, date_start=None, date_end=None, delay=100)[source]

Read DataFrame from RedShift with AWS Secrets Manager.

Parameters:

secret_name (str) – The name of AWS Secrets Manager.
region (str) – AWS region name.
query (str) – Querry to obtain data from Redshift.
date_start (str) – Date to start, strftime(‘%Y/%m/%d’).
date_end (str) – Date to end, strftime(‘%Y/%m/%d’).
delay (int) – Time to wait for the query.

Returns:

Target dataframe.

Return type:

df (pandas.DataFrame)

batcat.get_secret(secret_name, region)[source]

Get configurations from AWS Secret Mananger.

Parameters:

secret_name (str) – A secret name setted up in AWS Secrets Manager.
region (str) – The region name of AWS.

Returns:

The secret configurations.

Return type:

secret (dict)

Athena

batcat.read_data_from_athena(query, region, s3_staging_dir, date_start=None, date_end=None)[source]

Read data as DataFrame from AWS Athena.

Parameters:

query (str) – Querry to obtain data from Athena.
region (str) – Region of the AWS environment, eg. “cn-northwest-1”.
s3_staging_dir (str) – S3 staging directory, eg. “s3://#####-###-###-queryresult/ATHENA_QUERY”.
date_start (str) – Date to start, strftime(‘%Y/%m/%d’).
date_end (str) – Date to end, strftime(‘%Y/%m/%d’).

Returns:

dataframe.

Return type:

df (pandas.DataFrame)

SageMaker

batcat.deploy_model(model, model_name='model', bucket='[bucket]')[source]

Deploy an scikit-learn model to SageMaker Endpoint.

Parameters:

model – An scikit-learn model.
model_name (str) – The model name.
bucket (str) – The bucket to store model, which is also the project name in BatCat convention.

Returns:

The model, endpoint configuration, endpoint information.

Return type:

reponse (dict)

batcat.invoke(endpoint_name, input_data)[source]

Invokes a SageMaker endpoint with input data.

Parameters:

endpoint_name (str) – The name of the SageMaker endpoint.
input_data (list) – The input data to send to the endpoint.

Returns:

The response from the SageMaker endpoint.

Return type:

result (list)

Lambda

batcat.print_event(event)[source]

Print Lambda event name.

Parameters:: event (str) – S3 trigger event.
Returns:: None

batcat.get_bucket_key(event)[source]

Get S3 the bucket and key from event.

Parameters:: event (str) – S3 trigger event.
Returns:: Bucket and key of the event.
Return type:: bucket, key

Elastic Container Registry (ECR)

batcat.template_docker(project='[project]', uri_suffix='amazonaws.com.cn', pip_image=True, python_version='3.7-slim-buster')[source]

Build a docker image to AWS ECR for a machine learning project.

Parameters:

project (str) – Used as a name of an AWS ECR repository to be setup
uri_suffix (str) – Suffix of URL, default ‘amazonaws.com.cn’
pip_image (bool) – Whether a pip image is needed, default True and use douban image
python_version (str) – Python version, default ‘3.7-slim-buster’

Yields:

A template Docker setup Bash file and a template requirements file to the current directory.

SageMaker Processing

batcat.processing_output_path(purpose, timestamp=True, local=False)[source]

Setup a result path within container.

Parameters:

purpose (str) – A purpose under a project.
timestamp (bool) – Whether a timestamp in file name is needed.
local (bool) – If set the path to local for test.

Returns:

A CSV path for later usage.

Return type:

path (str)

batcat.setup_workflow(project='[project]', purpose='[purpose]', workflow_execution_role='arn:aws-cn:iam::[account-id]:role/[role-name]', instance_type='ml.t3.medium', ecr_uri_suffix='amazonaws.com.cn', ecr_tag=':latest', network_config=None, enable_network_isolation=False, security_group_ids=None, subnets=None)[source]

Setup all needed for a step function with sagemaker.

Parameters:

project (str) – project name under sagemaker.
purpose (str) – subproject.
workflow_execution_role (str) – arn to execute step functions.
instance_type (str) – instance type for processing job, default as ‘ml.t3.medium’; for better performance, try ‘ml.m5.4xlarge’.
ecr_uri_suffix (str) – ECR URI suffix, default as ‘amazonaws.com.cn’.
ecr_tag (str) – ECR tag, default as ‘:latest’.
network_config (sagemaker.network.NetworkConfig) – network configuration for processing job.
enable_network_isolation (bool) – whether to enable network isolation.
security_group_ids (list) – security group ids for processing job.
subnets (list) – subnets for processing job.

Returns:

a workflow instance.

Return type:

workflow (stepfunctions.workflow.Workflow)

batcat.test_workflow(workflow, project='[project]', purpose='[purpose]', result_s3_bucket='[s3-bucket]')[source]

Test a step function workflow.

Parameters:

workflow – a stepfunctions.workflow.Workflow instance
project – project name under sagemaker
purpose – subproject
result_s3_bucket – S3 bucket for saving results

Returns:

None

Step Functions

batcat.template_stepfunctions(project='[project]', purpose='[purpose]', result_s3_bucket='[s3-bucket]', workflow_execution_role='arn:[partition]:iam::[account-id]:role/[role-name]')[source]

Generate a template Python script for setting up Step Functions.

Parameters:

project (str) – Project name under SageMaker.
purpose (str) – Subproject.
result_s3_bucket (str) – S3 bucket for saving results.
workflow_execution_role (str) – Execution role ARN.

Yields:

A template AWS Step Functions setup file to the current directory.

batcat.template_lambda(project='[project]', purpose='[purpose]', result_s3_bucket='[s3-bucket]', partition='aws-cn')[source]

Generate a template Python script for setting up Lambda.

Parameters:

project (str) – Project name under SageMaker.
purpose (str) – Subproject.
result_s3_bucket (str) – S3 bucket for saving results.
partition (str) – The partition in which the resource is located. A partition is a group of Amazon Regions. Default as ‘aws-cn’.

Yields:

A template Lambda Functions file to the current directory.