BatCat API
Data Science Environment Setup
- batcat.FileSys(project_name=True)[source]
Establish a data science project file system.
- Parameters:
project_name (bool) – If a project name is needed, default True.
- Yields:
A data science project file system.
- batcat.get_logger(logName, logFile=False)[source]
Get a logger in one step.
Logging is one of the most underrated features. Two things (5&3) to take away from Logging in Python: 1. FIVE levels of importance that logs can contain(debug, info, warning, error, critical); 2. THREE components to configure a logger in Python (a logger, a formatter, and at least one handler).
- Parameters:
logName (str) – A logger name to display in loggings.
logFile (bool) – A target file to save loggins.
- Returns:
A well organized logger.
- Return type:
logger (logging.getLogger)
Simple Storage Service (S3)
- batcat.read_csv_from_bucket(bucket, key, encoding=None)[source]
Read CSV from AWS S3.
- Parameters:
bucket (str) – Bucket name of S3.
key (str) – Key of S3.
- Returns:
Dataframe.
- Return type:
df (pandas.DataFrame)
- batcat.read_excel_from_bucket(bucket, key, sheet_name=0, header=0)[source]
Read Excel from AWS S3.
- Parameters:
bucket (str) – Bucket name of S3.
key (str) – Key of S3.
sheet_name – The target sheet name of the excel.
- Returns:
Dataframe.
- Return type:
df (pandas.DataFrame)
- batcat.save_to_bucket(df, bucket, key)[source]
Save DataFrame to AWS S3.
- Parameters:
bucket (str) – Bucket name of S3.
key (str) – Key of S3.
df (pandas.DataFrame) – Dataframe.
- Returns:
HTTPS status code.
- Return type:
statues (int)
- batcat.copy_bucket_files(bucket, prefix, suffix, target_bucket, target_prefix, target_suffix, key_sub)[source]
- Parameters:
bucket (str) – Source bucket.
prefix (str) – Prefix of source files.
suffix (str) – Suffix of source files.
target_bucket (str) – Target bucket.
target_prefix (str) – Prefix of target files.
taret_suffix (str) – Suffix of target files.
key_sub (str) – Information to substract from source keys, a tuple.
- Returns:
None
Redshift
- batcat.get_date_with_delta(delta, format='%Y/%m/%d')[source]
Get the date delta days ago.
- Parameters:
delta (int) – The number of days ago.
- Returns:
Strftime(‘%Y/%m/%d’).
- Return type:
date (str)
- batcat.read_data_from_redshift(query, host, password, port=5439, database='dev', user='awsuser', date_start=None, date_end=None)[source]
Read DataFrame from RedShift with host and password.
- Parameters:
query (str) – Querry to obtain data from Redshift, str.
host (str) – Redshift configuration.
password (str) – Redshift configuration.
port (str) – Redshift configuration.
database (str) – Redshift configuration.
user (str) – Redshift configuration.
date_start (str) – Date to start, strftime(‘%Y/%m/%d’).
date_end (str) – Date to end, strftime(‘%Y/%m/%d’).
- Returns:
target dataframe
- Return type:
df (pandas.DataFrame)
- batcat.save_df_to_redshift(df, host=None, password=None, port=5439, database='dev', user='awsuser', table_name=None, schema='public', if_exists='append', index=True, index_label=None, chunksize=None, dtype=None, method=None)[source]
Save pd.DataFrame to RedShift with host and password. Refer to pandas.to_sql for more information.
- Parameters:
df (pandas.DataFrame) – target dataframe
host (str) – in the form [name].[id].[region].redshift.amazonaws.com
password (str) – Redshift configuration
port (str) – usually 5439
database (str) – Redshift configuration
user (str) – Redshift configuration
table_name (str) – target table name’
schema (str) – Specify the schema (if database flavor supports this). If None, use default schema.
if_exists (str) – How to behave if the table already exists, {‘fail’, ‘replace’, ‘append’}, default ‘append’. (1) fail: Raise a ValueError. (2) replace: Drop the table before inserting new values. (3) append: Insert new values to the existing table.
index (bool) – Write DataFrame index as a column, default True. Uses index_label as the column name in the table.
index_label (str or sequence) – Column label for index column(s), default None. If None is given (default) and index is True, then the index names are used. A sequence should be given if the DataFrame uses MultiIndex.
chunksize (int, optional) – Specify the number of rows in each batch to be written at a time. By default, all rows will be written at once.
dtype (dict or scalar, optional) – Specifying the datatype for columns. If a dictionary is used, the keys should be the column names and the values should be the SQLAlchemy types or strings for the sqlite3 legacy mode. If a scalar is provided, it will be applied to all columns.
method (str) – Controls the SQL insertion clause used: (1)None : Uses standard SQL
INSERT
clause (one per row). (2)’multi’: Pass multiple values in a singleINSERT
clause. (3)callable with signature(pd_table, conn, keys, data_iter)
. Details and a sample callable implementation can be found in the section insert method .
- Returns:
None
- batcat.read_data_from_redshift_by_secret(secret_name=None, region=None, query=None, date_start=None, date_end=None, delay=100)[source]
Read DataFrame from RedShift with AWS Secrets Manager.
- Parameters:
secret_name (str) – The name of AWS Secrets Manager.
region (str) – AWS region name.
query (str) – Querry to obtain data from Redshift.
date_start (str) – Date to start, strftime(‘%Y/%m/%d’).
date_end (str) – Date to end, strftime(‘%Y/%m/%d’).
delay (int) – Time to wait for the query.
- Returns:
Target dataframe.
- Return type:
df (pandas.DataFrame)
Athena
- batcat.read_data_from_athena(query, region, s3_staging_dir, date_start=None, date_end=None)[source]
Read data as DataFrame from AWS Athena.
- Parameters:
query (str) – Querry to obtain data from Athena.
region (str) – Region of the AWS environment, eg. “cn-northwest-1”.
s3_staging_dir (str) – S3 staging directory, eg. “s3://#####-###-###-queryresult/ATHENA_QUERY”.
date_start (str) – Date to start, strftime(‘%Y/%m/%d’).
date_end (str) – Date to end, strftime(‘%Y/%m/%d’).
- Returns:
dataframe.
- Return type:
df (pandas.DataFrame)
SageMaker
- batcat.deploy_model(model, model_name='model', bucket='[bucket]')[source]
Deploy an scikit-learn model to SageMaker Endpoint.
- Parameters:
model – An scikit-learn model.
model_name (str) – The model name.
bucket (str) – The bucket to store model, which is also the project name in BatCat convention.
- Returns:
The model, endpoint configuration, endpoint information.
- Return type:
reponse (dict)
- batcat.invoke(endpoint_name, input_data)[source]
Invokes a SageMaker endpoint with input data.
- Parameters:
endpoint_name (str) – The name of the SageMaker endpoint.
input_data (list) – The input data to send to the endpoint.
- Returns:
The response from the SageMaker endpoint.
- Return type:
result (list)
Lambda
Elastic Container Registry (ECR)
- batcat.template_docker(project='[project]', uri_suffix='amazonaws.com.cn', pip_image=True, python_version='3.7-slim-buster')[source]
Build a docker image to AWS ECR for a machine learning project.
- Parameters:
project (str) – Used as a name of an AWS ECR repository to be setup
uri_suffix (str) – Suffix of URL, default ‘amazonaws.com.cn’
pip_image (bool) – Whether a pip image is needed, default True and use douban image
python_version (str) – Python version, default ‘3.7-slim-buster’
- Yields:
A template Docker setup Bash file and a template requirements file to the current directory.
SageMaker Processing
- batcat.processing_output_path(purpose, timestamp=True, local=False)[source]
Setup a result path within container.
- Parameters:
purpose (str) – A purpose under a project.
timestamp (bool) – Whether a timestamp in file name is needed.
local (bool) – If set the path to local for test.
- Returns:
A CSV path for later usage.
- Return type:
path (str)
- batcat.setup_workflow(project='[project]', purpose='[purpose]', workflow_execution_role='arn:aws-cn:iam::[account-id]:role/[role-name]', instance_type='ml.t3.medium', ecr_uri_suffix='amazonaws.com.cn', ecr_tag=':latest', network_config=None, enable_network_isolation=False, security_group_ids=None, subnets=None)[source]
Setup all needed for a step function with sagemaker.
- Parameters:
project (str) – project name under sagemaker.
purpose (str) – subproject.
workflow_execution_role (str) – arn to execute step functions.
instance_type (str) – instance type for processing job, default as ‘ml.t3.medium’; for better performance, try ‘ml.m5.4xlarge’.
ecr_uri_suffix (str) – ECR URI suffix, default as ‘amazonaws.com.cn’.
ecr_tag (str) – ECR tag, default as ‘:latest’.
network_config (sagemaker.network.NetworkConfig) – network configuration for processing job.
enable_network_isolation (bool) – whether to enable network isolation.
security_group_ids (list) – security group ids for processing job.
subnets (list) – subnets for processing job.
- Returns:
a workflow instance.
- Return type:
workflow (stepfunctions.workflow.Workflow)
- batcat.test_workflow(workflow, project='[project]', purpose='[purpose]', result_s3_bucket='[s3-bucket]')[source]
Test a step function workflow.
- Parameters:
workflow – a stepfunctions.workflow.Workflow instance
project – project name under sagemaker
purpose – subproject
result_s3_bucket – S3 bucket for saving results
- Returns:
None
Step Functions
- batcat.template_stepfunctions(project='[project]', purpose='[purpose]', result_s3_bucket='[s3-bucket]', workflow_execution_role='arn:[partition]:iam::[account-id]:role/[role-name]')[source]
Generate a template Python script for setting up Step Functions.
- Parameters:
project (str) – Project name under SageMaker.
purpose (str) – Subproject.
result_s3_bucket (str) – S3 bucket for saving results.
workflow_execution_role (str) – Execution role ARN.
- Yields:
A template AWS Step Functions setup file to the current directory.
- batcat.template_lambda(project='[project]', purpose='[purpose]', result_s3_bucket='[s3-bucket]', partition='aws-cn')[source]
Generate a template Python script for setting up Lambda.
- Parameters:
project (str) – Project name under SageMaker.
purpose (str) – Subproject.
result_s3_bucket (str) – S3 bucket for saving results.
partition (str) – The partition in which the resource is located. A partition is a group of Amazon Regions. Default as ‘aws-cn’.
- Yields:
A template Lambda Functions file to the current directory.