optimizer
zeus.optimizer.pipeline_frequency.optimizer
Pipeline frequency optimizer implementation.
The PipelineFrequencyOptimizer
is to be integrated into the training framework.
It is responsible for communicating with the PFO server and managing
the FrequencyController
instance, which is responsible for controlling
the frequency of the CPU of the current process.
PipelineFrequencyOptimizer
Bases: Callback
Pipeline frequency optimizer.
Source code in zeus/optimizer/pipeline_frequency/optimizer.py
29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 |
|
__init__
__init__(
rank,
dp_rank,
pp_rank,
tp_rank,
device_id,
dp_degree,
pp_degree,
tp_degree,
world_size,
server_url,
job_metadata=None,
)
Assumptions
torch.distributed
has been initialized.torch.cuda.set_device
has been called withdevice_id
. This is needed to broadcast the job ID to all ranks.
The master process (rank 0) will register the job with the Peresus server and retrieve the job ID of this job. Then, each rank will report itself to the PFO server with the job ID.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
rank |
int
|
Global rank of the current process. |
required |
dp_rank |
int
|
Rank in the data parallel group. |
required |
pp_rank |
int
|
Rank in the pipeline parallel group. |
required |
tp_rank |
int
|
Rank in the tensor parallel group. |
required |
device_id |
int
|
CUDA device ID that the current process manages. |
required |
dp_degree |
int
|
Size of the data parallel group. |
required |
pp_degree |
int
|
Size of the pipeline parallel group. |
required |
tp_degree |
int
|
Size of the tensor parallel group. |
required |
world_size |
int
|
Total number of ranks that participate in training. |
required |
server_url |
str
|
URL of the PFO server. |
required |
job_metadata |
str | None
|
An optional arbitrary string that describes the job. This will be appended to the job ID if given. Typically for logging purposes. |
None
|
Source code in zeus/optimizer/pipeline_frequency/optimizer.py
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 |
|
_get_frequency_schedule
_get_frequency_schedule()
Get the frequency schedule from the PFO server.
Source code in zeus/optimizer/pipeline_frequency/optimizer.py
146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 |
|
on_step_begin
on_step_begin()
Mark the beginning of a step.
TODO(jaywonchung): InstructionProfiler iteration start mark.
Source code in zeus/optimizer/pipeline_frequency/optimizer.py
164 165 166 167 168 169 |
|
on_step_end
on_step_end()
Mark the end of a step.
TODO(jaywonchung): InstructionProfiler iteration end mark. Also report the profiling result to the PFO server after N iterations.
Source code in zeus/optimizer/pipeline_frequency/optimizer.py
171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 |
|
on_instruction_begin
on_instruction_begin(name)
Mark the beginning of an instruction, like forward and backward.
Retrieve the next frequency from the schedule, check whether the next expected instruction matches the name of the instruction, and set the frequency accordingly.
Source code in zeus/optimizer/pipeline_frequency/optimizer.py
187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 |
|
on_instruction_end
on_instruction_end(name)
Mark the end of an instruction, like forward and backward.
Source code in zeus/optimizer/pipeline_frequency/optimizer.py
210 211 |
|