common
zeus.optimizer.pipeline_frequency.common
Shared constants and models between the server and the client (optimizer).
PFOServerSettings
Bases: BaseSettings
PFO server settings, configurable via environment variables.
For instance, setting ZEUS_PFO_LOG_LEVEL=INFO will automatically set
the log_level variable to "INFO".
Attributes:
| Name | Type | Description |
|---|---|---|
scheduler |
PyObject
|
Name of the |
scheduler_args |
dict[str, Any]
|
Any extra arguments required by |
log_level |
str
|
Log level, e.g. "debug", "info". |
dump_data |
bool
|
Whether the scheduler should dump internal state to the filesystem (for future inspection purposes). |
dump_dir |
str
|
Directory to dump state in (if enabled) |
max_job_idle_time |
int
|
Maximum time in seconds that a job can be idle for before its states are automatically deleted from the server. |
Source code in zeus/optimizer/pipeline_frequency/common.py
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 | |
Config
Configuration class read by pydantic.
Source code in zeus/optimizer/pipeline_frequency/common.py
67 68 69 70 | |
_fix_scheduler_import_path
_fix_scheduler_import_path(value)
Prepend zeus.optimizer.pipeline_frequency.server.scheduler. to the scheduler type name.
Source code in zeus/optimizer/pipeline_frequency/common.py
46 47 48 49 | |
_validate_scheduler_args
_validate_scheduler_args(args, values)
Check whether args are as expected by the scheduler's constructor.
Source code in zeus/optimizer/pipeline_frequency/common.py
51 52 53 54 55 56 57 58 59 60 61 | |
JobInfo
Bases: BaseModel
Training job information reported to the server.
Attributes:
| Name | Type | Description |
|---|---|---|
job_id |
str
|
Globally unique ID of the training job, generated by the server. This field should be an empty string when sent to the server. |
pp_degree |
int
|
Pipeline parallel degree. |
dp_degree |
int
|
Data parallel degree. |
tp_degree |
int
|
Tensor parallel degree. |
world_size |
int
|
World size of the training job. |
job_metadata |
Optional[str]
|
An optional arbitrary string that describes the job. This will be appended to the job ID if given. Typically for logging purposes. |
Source code in zeus/optimizer/pipeline_frequency/common.py
73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 | |
_check_world_size
_check_world_size(world_size, values)
Product of PP, DP, and TP degree would be identical to the world size.
Source code in zeus/optimizer/pipeline_frequency/common.py
99 100 101 102 103 | |
set_job_id
set_job_id(scheduler_name)
Generate and set the job ID.
Source code in zeus/optimizer/pipeline_frequency/common.py
105 106 107 108 109 110 111 112 113 114 115 116 117 | |
RankInfo
Bases: BaseModel
Information passed to the server from each rank.
Attributes:
| Name | Type | Description |
|---|---|---|
rank |
int
|
Global rank of the reporting process. |
dp_rank |
int
|
Data parallel rank of the reporting procees. |
pp_rank |
int
|
Pipeline parallel rank of the reporting procees. |
tp_rank |
int
|
Tensor parallel rank of the reporting procees. |
available_frequencies |
list[int]
|
List of available frequencies for the rank's GPU. |
Source code in zeus/optimizer/pipeline_frequency/common.py
120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 | |
FrequencySchedule
Bases: BaseModel
Frequency schedule for one iteration.
frequencies is a list of tuples, where the first element is the name of the
instruction and the second element is the frequency to use for that instruction.
Source code in zeus/optimizer/pipeline_frequency/common.py
138 139 140 141 142 143 144 145 146 | |
ProfilingResult
Bases: BaseModel
Profiling results for a FrequencySchedule of a rank.
Attributes:
| Name | Type | Description |
|---|---|---|
rank |
int
|
Global rank of the reporting client. |
iter_time |
list[float]
|
List of latency of all iterations within the profiling window in seconds. |
iter_energy |
list[float]
|
List of energy consumption of all iterations within the profiling window in Joules. |
time_breakdown |
dict[str, list[list[float]]]
|
Duration of each operation across multiple iterations.
e.g. |
energy_breakdown |
dict[str, list[list[float]]]
|
Energy consumption of each operation across multple iterations.
Value has the same structure as |
Source code in zeus/optimizer/pipeline_frequency/common.py
149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 | |
OfflineProfilingResult
Bases: BaseModel
Profiling results generated from offline profiling each instruction.
Attributes:
| Name | Type | Description |
|---|---|---|
rank |
int
|
Global rank of the reporting client. |
dp_rank |
int
|
Data parallel rank of the reporting procees. |
pp_rank |
int
|
Pipeline parallel rank of the reporting procees. |
tp_rank |
int
|
Tensor parallel rank of the reporting procees. |
forward_time |
dict[int, float]
|
Dict that maps frequency to average forward computation time. |
forward_energy |
dict[int, float]
|
Dict that maps frequency to average forward computation energy. |
backward_time |
dict[int, float]
|
Dict that maps frequency to average backward computation time. |
backward_energy |
dict[int, float]
|
Dict that maps frequency to average backward computation energy. |
Source code in zeus/optimizer/pipeline_frequency/common.py
170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 | |
InstructionProfilingResult
Bases: BaseModel
Time and energy profiling results for each instruction in each stage.
Source code in zeus/optimizer/pipeline_frequency/common.py
194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 | |
to_csv
to_csv(filepath)
Serialize and save this object into a CSV file.
Columns: rank, dp_rank, pp_rank, tp_rank, stage, instruction, frequency, time, energy
Notes
- rank is the global rank of the process.
- pp_rank and stage are always the same, for backwards compatibility.
- All ranks and stage are zero-indexed.
- instruction is either "forward" or "backward".
- time and energy are already averaged over profiling iterations.
Source code in zeus/optimizer/pipeline_frequency/common.py
199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 | |
save_prof
async
save_prof(data, directory, schedule_num)
Save a list of ProfilingResults in the designated directory.
Source code in zeus/optimizer/pipeline_frequency/common.py
228 229 230 231 232 233 234 235 236 237 | |
load_prof
load_prof(directory, schedule_num)
Load a list of ProfilingResults saved in the designated directory.
Source code in zeus/optimizer/pipeline_frequency/common.py
240 241 242 243 | |
save_sched
async
save_sched(data, directory, schedule_num)
Save a list of FrequencySchedules in the designated directory.
Source code in zeus/optimizer/pipeline_frequency/common.py
246 247 248 249 250 251 252 253 254 255 | |
load_sched
load_sched(directory, schedule_num)
Load a list of FrequencySchedules saved in the designated directory.
Source code in zeus/optimizer/pipeline_frequency/common.py
258 259 260 261 | |
save_ranks
async
save_ranks(data, directory)
Save a list of RankInfos in the designated directory.
Source code in zeus/optimizer/pipeline_frequency/common.py
264 265 266 267 268 269 | |
load_ranks
load_ranks(directory)
Load a list of RankInfos saved in the designated directory.
Source code in zeus/optimizer/pipeline_frequency/common.py
272 273 274 275 | |