codeflare_sdk.ray.cluster package

Submodules

codeflare_sdk.ray.cluster.cluster module

The cluster sub-module contains the definition of the Cluster object, which represents the resources requested by the user. It also contains functions for checking the cluster setup queue, a list of all existing clusters, and the user’s working namespace.

class codeflare_sdk.ray.cluster.cluster.Cluster(config: ClusterConfiguration)[source]

Bases: object

An object for requesting, bringing up, and taking down resources. Can also be used for seeing the resource cluster status and details.

Note that currently, the underlying implementation is a Ray cluster.

cluster_dashboard_uri() → str[source]: Returns a string containing the cluster’s dashboard URI.

cluster_uri() → str[source]: Returns a string containing the cluster’s URI.

create_app_wrapper()[source]: Called upon cluster object creation, creates an AppWrapper yaml based on the specifications of the ClusterConfiguration.

details(print_to_console: bool = True) → RayCluster[source]

down()[source]: Deletes the AppWrapper yaml, scaling-down and deleting all resources associated with the cluster.

from_k8_cluster_object(appwrapper=True, write_to_file=False, verify_tls=True)[source]

is_dashboard_ready() → bool[source]

property job_client

job_logs(job_id: str) → str[source]: This method accesses the head ray node in your cluster and returns the logs for the provided job id.

job_status(job_id: str) → str[source]: This method accesses the head ray node in your cluster and returns the job status for the provided job id.

list_jobs() → List[source]: This method accesses the head ray node in your cluster and lists the running jobs.

local_client_url()[source]

status(print_to_console: bool = True) → Tuple[CodeFlareClusterStatus, bool][source]: Returns the requested cluster’s status, as well as whether or not it is ready for use.

up()[source]: Applies the Cluster yaml, pushing the resource request onto the Kueue localqueue.

wait_ready(timeout: int | None = None, dashboard_check: bool = True)[source]: Waits for requested cluster to be ready, up to an optional timeout (s). Checks every five seconds.

codeflare_sdk.ray.cluster.cluster.get_cluster(cluster_name: str, namespace: str = 'default', write_to_file: bool = False, verify_tls: bool = True)[source]

codeflare_sdk.ray.cluster.cluster.get_current_namespace()[source]

codeflare_sdk.ray.cluster.cluster.list_all_clusters(namespace: str, print_to_console: bool = True)[source]: Returns (and prints by default) a list of all clusters in a given namespace.

codeflare_sdk.ray.cluster.cluster.list_all_queued(namespace: str, print_to_console: bool = True, appwrapper: bool = False)[source]: Returns (and prints by default) a list of all currently queued-up Ray Clusters in a given namespace.

codeflare_sdk.ray.cluster.config module

The config sub-module contains the definition of the ClusterConfiguration dataclass, which is used to specify resource requirements and other details when creating a Cluster object.

class codeflare_sdk.ray.cluster.config.ClusterConfiguration(name: str, namespace: str | None = None, head_info: ~typing.List[str] = <factory>, head_cpu_requests: int | str = 2, head_cpu_limits: int | str = 2, head_cpus: int | str | None = None, head_memory_requests: int | str = 8, head_memory_limits: int | str = 8, head_memory: int | str | None = None, head_gpus: int | None = None, head_extended_resource_requests: ~typing.Dict[str, str | int] = <factory>, machine_types: ~typing.List[str] = <factory>, worker_cpu_requests: int | str = 1, worker_cpu_limits: int | str = 1, min_cpus: int | str | None = None, max_cpus: int | str | None = None, num_workers: int = 1, worker_memory_requests: int | str = 2, worker_memory_limits: int | str = 2, min_memory: int | str | None = None, max_memory: int | str | None = None, num_gpus: int | None = None, template: str = '/home/runner/work/codeflare-sdk/codeflare-sdk/src/codeflare_sdk/ray/templates/base-template.yaml', appwrapper: bool = False, envs: ~typing.Dict[str, str] = <factory>, image: str = '', image_pull_secrets: ~typing.List[str] = <factory>, write_to_file: bool = False, verify_tls: bool = True, labels: ~typing.Dict[str, str] = <factory>, worker_extended_resource_requests: ~typing.Dict[str, str | int] = <factory>, extended_resource_mapping: ~typing.Dict[str, str] = <factory>, overwrite_default_resource_mapping: bool = False, local_queue: str | None = None)[source]

Bases: object

This dataclass is used to specify resource requirements and other details, and is passed in as an argument when creating a Cluster object.

Attributes: - name: The name of the cluster. - namespace: The namespace in which the cluster should be created. - head_info: A list of strings containing information about the head node. - head_cpus: The number of CPUs to allocate to the head node. - head_memory: The amount of memory to allocate to the head node. - head_gpus: The number of GPUs to allocate to the head node. (Deprecated, use head_extended_resource_requests) - head_extended_resource_requests: A dictionary of extended resource requests for the head node. ex: {“nvidia.com/gpu”: 1} - machine_types: A list of machine types to use for the cluster. - min_cpus: The minimum number of CPUs to allocate to each worker. - max_cpus: The maximum number of CPUs to allocate to each worker. - num_workers: The number of workers to create. - min_memory: The minimum amount of memory to allocate to each worker. - max_memory: The maximum amount of memory to allocate to each worker. - num_gpus: The number of GPUs to allocate to each worker. (Deprecated, use worker_extended_resource_requests) - template: The path to the template file to use for the cluster. - appwrapper: A boolean indicating whether to use an AppWrapper. - envs: A dictionary of environment variables to set for the cluster. - image: The image to use for the cluster. - image_pull_secrets: A list of image pull secrets to use for the cluster. - write_to_file: A boolean indicating whether to write the cluster configuration to a file. - verify_tls: A boolean indicating whether to verify TLS when connecting to the cluster. - labels: A dictionary of labels to apply to the cluster. - worker_extended_resource_requests: A dictionary of extended resource requests for each worker. ex: {“nvidia.com/gpu”: 1} - extended_resource_mapping: A dictionary of custom resource mappings to map extended resource requests to RayCluster resource names - overwrite_default_resource_mapping: A boolean indicating whether to overwrite the default resource mapping.

appwrapper: bool = False

envs: Dict[str, str]

extended_resource_mapping: Dict[str, str]

head_cpu_limits: int | str = 2

head_cpu_requests: int | str = 2

head_cpus: int | str | None = None

head_extended_resource_requests: Dict[str, str | int]

head_gpus: int | None = None

head_info: List[str]

head_memory: int | str | None = None

head_memory_limits: int | str = 8

head_memory_requests: int | str = 8

image: str = ''

image_pull_secrets: List[str]

labels: Dict[str, str]

local_queue: str | None = None

machine_types: List[str]

max_cpus: int | str | None = None

max_memory: int | str | None = None

min_cpus: int | str | None = None

min_memory: int | str | None = None

name: str

namespace: str | None = None

num_gpus: int | None = None

num_workers: int = 1

overwrite_default_resource_mapping: bool = False

template: str = '/home/runner/work/codeflare-sdk/codeflare-sdk/src/codeflare_sdk/ray/templates/base-template.yaml'

verify_tls: bool = True

worker_cpu_limits: int | str = 1

worker_cpu_requests: int | str = 1

worker_extended_resource_requests: Dict[str, str | int]

worker_memory_limits: int | str = 2

worker_memory_requests: int | str = 2

write_to_file: bool = False

codeflare_sdk.ray.cluster.generate_yaml module

This sub-module exists primarily to be used internally by the Cluster object (in the cluster sub-module) for AppWrapper generation.

codeflare_sdk.ray.cluster.generate_yaml.augment_labels(item: dict, labels: dict)[source]

codeflare_sdk.ray.cluster.generate_yaml.del_from_list_by_name(l: list, target: List[str]) → list[source]

codeflare_sdk.ray.cluster.generate_yaml.gen_names(name)[source]

codeflare_sdk.ray.cluster.generate_yaml.generate_appwrapper(cluster: Cluster)[source]

codeflare_sdk.ray.cluster.generate_yaml.head_worker_gpu_count_from_cluster(cluster: Cluster) → Tuple[int, int][source]

codeflare_sdk.ray.cluster.generate_yaml.head_worker_resources_from_cluster(cluster: Cluster) → Tuple[dict, dict][source]

codeflare_sdk.ray.cluster.generate_yaml.is_kind_cluster()[source]

codeflare_sdk.ray.cluster.generate_yaml.is_openshift_cluster()[source]

codeflare_sdk.ray.cluster.generate_yaml.notebook_annotations(item: dict)[source]

codeflare_sdk.ray.cluster.generate_yaml.read_template(template)[source]

codeflare_sdk.ray.cluster.generate_yaml.update_env(spec, env)[source]

codeflare_sdk.ray.cluster.generate_yaml.update_image(spec, image)[source]

codeflare_sdk.ray.cluster.generate_yaml.update_image_pull_secrets(spec, image_pull_secrets)[source]

codeflare_sdk.ray.cluster.generate_yaml.update_names(cluster_yaml: dict, cluster: Cluster)[source]

codeflare_sdk.ray.cluster.generate_yaml.update_nodes(ray_cluster_dict: dict, cluster: Cluster)[source]

codeflare_sdk.ray.cluster.generate_yaml.update_resources(spec, cpu_requests, cpu_limits, memory_requests, memory_limits, custom_resources)[source]

codeflare_sdk.ray.cluster.generate_yaml.wrap_cluster(cluster_yaml: dict, appwrapper_name: str, namespace: str)[source]

codeflare_sdk.ray.cluster.generate_yaml.write_user_yaml(user_yaml, output_file_name)[source]

codeflare_sdk.ray.cluster.pretty_print module

This sub-module exists primarily to be used internally by the Cluster object (in the cluster sub-module) for pretty-printing cluster status and details.

codeflare_sdk.ray.cluster.pretty_print.print_app_wrappers_status(app_wrappers: List[AppWrapper], starting: bool = False)[source]

codeflare_sdk.ray.cluster.pretty_print.print_cluster_status(cluster: RayCluster)[source]: Pretty prints the status of a passed-in cluster

codeflare_sdk.ray.cluster.pretty_print.print_clusters(clusters: List[RayCluster])[source]

codeflare_sdk.ray.cluster.pretty_print.print_no_resources_found()[source]

codeflare_sdk.ray.cluster.pretty_print.print_ray_clusters_status(app_wrappers: List[AppWrapper], starting: bool = False)[source]

codeflare_sdk.ray.cluster.status module

The status sub-module defines Enums containing information for Ray cluster states states, and CodeFlare cluster states, as well as dataclasses to store information for Ray clusters.

class codeflare_sdk.ray.cluster.status.CodeFlareClusterStatus(value)[source]

Bases: Enum

Defines the possible reportable states of a Codeflare cluster.

FAILED = 5

QUEUED = 3

QUEUEING = 4

READY = 1

STARTING = 2

SUSPENDED = 7

UNKNOWN = 6

class codeflare_sdk.ray.cluster.status.RayCluster(name: str, status: ~codeflare_sdk.ray.cluster.status.RayClusterStatus, head_cpu_requests: int, head_cpu_limits: int, head_mem_requests: str, head_mem_limits: str, num_workers: int, worker_mem_requests: str, worker_mem_limits: str, worker_cpu_requests: int | str, worker_cpu_limits: int | str, namespace: str, dashboard: str, worker_extended_resources: ~typing.Dict[str, int] = <factory>, head_extended_resources: ~typing.Dict[str, int] = <factory>)[source]

Bases: object

For storing information about a Ray cluster.

dashboard: str

head_cpu_limits: int

head_cpu_requests: int

head_extended_resources: Dict[str, int]

head_mem_limits: str

head_mem_requests: str

name: str

namespace: str

num_workers: int

status: RayClusterStatus

worker_cpu_limits: int | str

worker_cpu_requests: int | str

worker_extended_resources: Dict[str, int]

worker_mem_limits: str

worker_mem_requests: str

class codeflare_sdk.ray.cluster.status.RayClusterStatus(value)[source]

Bases: Enum

Defines the possible reportable states of a Ray cluster.

FAILED = 'failed'

READY = 'ready'

SUSPENDED = 'suspended'

UNHEALTHY = 'unhealthy'

UNKNOWN = 'unknown'

codeflare_sdk.ray.cluster package

Submodules

codeflare_sdk.ray.cluster.cluster module

codeflare_sdk.ray.cluster.config module

codeflare_sdk.ray.cluster.generate_yaml module

codeflare_sdk.ray.cluster.pretty_print module

codeflare_sdk.ray.cluster.status module

Module contents