BatCat API

Data Science Environment Setup

batcat.FileSys(project_name=True)[source]

Establish a data science project file system.

Parameters:

project_name (bool) – If a project name is needed, default True.

Yields:

A data science project file system.

batcat.get_logger(logName, logFile=False)[source]

Get a logger in one step.

Logging is one of the most underrated features. Two things (5&3) to take away from Logging in Python: 1. FIVE levels of importance that logs can contain(debug, info, warning, error, critical); 2. THREE components to configure a logger in Python (a logger, a formatter, and at least one handler).

Parameters:
  • logName (str) – A logger name to display in loggings.

  • logFile (bool) – A target file to save loggins.

Returns:

A well organized logger.

Return type:

logger (logging.getLogger)

batcat.get_config(file='config.json')[source]

Get configurations.

Parameters:

file (str) – Configuration file, dedault ‘config.json’.

Returns:

A configuration dictionary.

Return type:

config

Simple Storage Service (S3)

batcat.read_csv_from_bucket(bucket, key, encoding=None)[source]

Read CSV from AWS S3.

Parameters:
  • bucket (str) – Bucket name of S3.

  • key (str) – Key of S3.

Returns:

Dataframe.

Return type:

df (pandas.DataFrame)

batcat.read_excel_from_bucket(bucket, key, sheet_name=0, header=0)[source]

Read Excel from AWS S3.

Parameters:
  • bucket (str) – Bucket name of S3.

  • key (str) – Key of S3.

  • sheet_name – The target sheet name of the excel.

Returns:

Dataframe.

Return type:

df (pandas.DataFrame)

batcat.save_to_bucket(df, bucket, key)[source]

Save DataFrame to AWS S3.

Parameters:
  • bucket (str) – Bucket name of S3.

  • key (str) – Key of S3.

  • df (pandas.DataFrame) – Dataframe.

Returns:

HTTPS status code.

Return type:

statues (int)

batcat.copy_bucket_files(bucket, prefix, suffix, target_bucket, target_prefix, target_suffix, key_sub)[source]
Parameters:
  • bucket (str) – Source bucket.

  • prefix (str) – Prefix of source files.

  • suffix (str) – Suffix of source files.

  • target_bucket (str) – Target bucket.

  • target_prefix (str) – Prefix of target files.

  • taret_suffix (str) – Suffix of target files.

  • key_sub (str) – Information to substract from source keys, a tuple.

Returns:

None

batcat.SuccessSignal(bucket, key='.success')[source]
Parameters:
  • bucet (str) – Target bucket to receive a signal.

  • key (str) – Signal file.

Returns:

HTTPS status code.

Return type:

statue (int)

Redshift

batcat.get_date_with_delta(delta, format='%Y/%m/%d')[source]

Get the date delta days ago.

Parameters:

delta (int) – The number of days ago.

Returns:

Strftime(‘%Y/%m/%d’).

Return type:

date (str)

batcat.read_data_from_redshift(query, host, password, port=5439, database='dev', user='awsuser', date_start=None, date_end=None)[source]

Read DataFrame from RedShift with host and password.

Parameters:
  • query (str) – Querry to obtain data from Redshift, str.

  • host (str) – Redshift configuration.

  • password (str) – Redshift configuration.

  • port (str) – Redshift configuration.

  • database (str) – Redshift configuration.

  • user (str) – Redshift configuration.

  • date_start (str) – Date to start, strftime(‘%Y/%m/%d’).

  • date_end (str) – Date to end, strftime(‘%Y/%m/%d’).

Returns:

target dataframe

Return type:

df (pandas.DataFrame)

batcat.save_df_to_redshift(df, host=None, password=None, port=5439, database='dev', user='awsuser', table_name=None, schema='public', if_exists='append', index=True, index_label=None, chunksize=None, dtype=None, method=None)[source]

Save pd.DataFrame to RedShift with host and password. Refer to pandas.to_sql for more information.

Parameters:
  • df (pandas.DataFrame) – target dataframe

  • host (str) – in the form [name].[id].[region].redshift.amazonaws.com

  • password (str) – Redshift configuration

  • port (str) – usually 5439

  • database (str) – Redshift configuration

  • user (str) – Redshift configuration

  • table_name (str) – target table name’

  • schema (str) – Specify the schema (if database flavor supports this). If None, use default schema.

  • if_exists (str) – How to behave if the table already exists, {‘fail’, ‘replace’, ‘append’}, default ‘append’. (1) fail: Raise a ValueError. (2) replace: Drop the table before inserting new values. (3) append: Insert new values to the existing table.

  • index (bool) – Write DataFrame index as a column, default True. Uses index_label as the column name in the table.

  • index_label (str or sequence) – Column label for index column(s), default None. If None is given (default) and index is True, then the index names are used. A sequence should be given if the DataFrame uses MultiIndex.

  • chunksize (int, optional) – Specify the number of rows in each batch to be written at a time. By default, all rows will be written at once.

  • dtype (dict or scalar, optional) – Specifying the datatype for columns. If a dictionary is used, the keys should be the column names and the values should be the SQLAlchemy types or strings for the sqlite3 legacy mode. If a scalar is provided, it will be applied to all columns.

  • method (str) – Controls the SQL insertion clause used: (1)None : Uses standard SQL INSERT clause (one per row). (2)’multi’: Pass multiple values in a single INSERT clause. (3)callable with signature (pd_table, conn, keys, data_iter). Details and a sample callable implementation can be found in the section insert method .

Returns:

None

batcat.read_data_from_redshift_by_secret(secret_name=None, region=None, query=None, date_start=None, date_end=None, delay=100)[source]

Read DataFrame from RedShift with AWS Secrets Manager.

Parameters:
  • secret_name (str) – The name of AWS Secrets Manager.

  • region (str) – AWS region name.

  • query (str) – Querry to obtain data from Redshift.

  • date_start (str) – Date to start, strftime(‘%Y/%m/%d’).

  • date_end (str) – Date to end, strftime(‘%Y/%m/%d’).

  • delay (int) – Time to wait for the query.

Returns:

Target dataframe.

Return type:

df (pandas.DataFrame)

batcat.get_secret(secret_name, region)[source]

Get configurations from AWS Secret Mananger.

Parameters:
  • secret_name (str) – A secret name setted up in AWS Secrets Manager.

  • region (str) – The region name of AWS.

Returns:

The secret configurations.

Return type:

secret (dict)

Athena

batcat.read_data_from_athena(query, region, s3_staging_dir, date_start=None, date_end=None)[source]

Read data as DataFrame from AWS Athena.

Parameters:
  • query (str) – Querry to obtain data from Athena.

  • region (str) – Region of the AWS environment, eg. “cn-northwest-1”.

  • s3_staging_dir (str) – S3 staging directory, eg. “s3://#####-###-###-queryresult/ATHENA_QUERY”.

  • date_start (str) – Date to start, strftime(‘%Y/%m/%d’).

  • date_end (str) – Date to end, strftime(‘%Y/%m/%d’).

Returns:

dataframe.

Return type:

df (pandas.DataFrame)

SageMaker

batcat.deploy_model(model, model_name='model', bucket='[bucket]')[source]

Deploy an scikit-learn model to SageMaker Endpoint.

Parameters:
  • model – An scikit-learn model.

  • model_name (str) – The model name.

  • bucket (str) – The bucket to store model, which is also the project name in BatCat convention.

Returns:

The model, endpoint configuration, endpoint information.

Return type:

reponse (dict)

batcat.invoke(endpoint_name, input_data)[source]

Invokes a SageMaker endpoint with input data.

Parameters:
  • endpoint_name (str) – The name of the SageMaker endpoint.

  • input_data (list) – The input data to send to the endpoint.

Returns:

The response from the SageMaker endpoint.

Return type:

result (list)

Lambda

batcat.print_event(event)[source]

Print Lambda event name.

Parameters:

event (str) – S3 trigger event.

Returns:

None

batcat.get_bucket_key(event)[source]

Get S3 the bucket and key from event.

Parameters:

event (str) – S3 trigger event.

Returns:

Bucket and key of the event.

Return type:

bucket, key

Elastic Container Registry (ECR)

batcat.template_docker(project='[project]', uri_suffix='amazonaws.com.cn', pip_image=True, python_version='3.7-slim-buster')[source]

Build a docker image to AWS ECR for a machine learning project.

Parameters:
  • project (str) – Used as a name of an AWS ECR repository to be setup

  • uri_suffix (str) – Suffix of URL, default ‘amazonaws.com.cn’

  • pip_image (bool) – Whether a pip image is needed, default True and use douban image

  • python_version (str) – Python version, default ‘3.7-slim-buster’

Yields:

A template Docker setup Bash file and a template requirements file to the current directory.

SageMaker Processing

batcat.processing_output_path(purpose, timestamp=True, local=False)[source]

Setup a result path within container.

Parameters:
  • purpose (str) – A purpose under a project.

  • timestamp (bool) – Whether a timestamp in file name is needed.

  • local (bool) – If set the path to local for test.

Returns:

A CSV path for later usage.

Return type:

path (str)

batcat.setup_workflow(project='[project]', purpose='[purpose]', workflow_execution_role='arn:aws-cn:iam::[account-id]:role/[role-name]', instance_type='ml.t3.medium', ecr_uri_suffix='amazonaws.com.cn', ecr_tag=':latest', network_config=None, enable_network_isolation=False, security_group_ids=None, subnets=None)[source]

Setup all needed for a step function with sagemaker.

Parameters:
  • project (str) – project name under sagemaker.

  • purpose (str) – subproject.

  • workflow_execution_role (str) – arn to execute step functions.

  • instance_type (str) – instance type for processing job, default as ‘ml.t3.medium’; for better performance, try ‘ml.m5.4xlarge’.

  • ecr_uri_suffix (str) – ECR URI suffix, default as ‘amazonaws.com.cn’.

  • ecr_tag (str) – ECR tag, default as ‘:latest’.

  • network_config (sagemaker.network.NetworkConfig) – network configuration for processing job.

  • enable_network_isolation (bool) – whether to enable network isolation.

  • security_group_ids (list) – security group ids for processing job.

  • subnets (list) – subnets for processing job.

Returns:

a workflow instance.

Return type:

workflow (stepfunctions.workflow.Workflow)

batcat.test_workflow(workflow, project='[project]', purpose='[purpose]', result_s3_bucket='[s3-bucket]')[source]

Test a step function workflow.

Parameters:
  • workflow – a stepfunctions.workflow.Workflow instance

  • project – project name under sagemaker

  • purpose – subproject

  • result_s3_bucket – S3 bucket for saving results

Returns:

None

Step Functions

batcat.template_stepfunctions(project='[project]', purpose='[purpose]', result_s3_bucket='[s3-bucket]', workflow_execution_role='arn:[partition]:iam::[account-id]:role/[role-name]')[source]

Generate a template Python script for setting up Step Functions.

Parameters:
  • project (str) – Project name under SageMaker.

  • purpose (str) – Subproject.

  • result_s3_bucket (str) – S3 bucket for saving results.

  • workflow_execution_role (str) – Execution role ARN.

Yields:

A template AWS Step Functions setup file to the current directory.

batcat.template_lambda(project='[project]', purpose='[purpose]', result_s3_bucket='[s3-bucket]', partition='aws-cn')[source]

Generate a template Python script for setting up Lambda.

Parameters:
  • project (str) – Project name under SageMaker.

  • purpose (str) – Subproject.

  • result_s3_bucket (str) – S3 bucket for saving results.

  • partition (str) – The partition in which the resource is located. A partition is a group of Amazon Regions. Default as ‘aws-cn’.

Yields:

A template Lambda Functions file to the current directory.