common
zeus.optimizer.pipeline_frequency.common
Shared constants and models between the server and the client (optimizer).
PFOServerSettings
Bases: BaseSettings
PFO server settings, configurable via environment variables.
For instance, setting ZEUS_PFO_LOG_LEVEL=INFO
will automatically set
the log_level
variable to "INFO"
.
Attributes:
Name | Type | Description |
---|---|---|
scheduler |
PyObject
|
Name of the |
scheduler_args |
dict[str, Any]
|
Any extra arguments required by |
log_level |
str
|
Log level, e.g. "debug", "info". |
dump_data |
bool
|
Whether the scheduler should dump internal state to the filesystem (for future inspection purposes). |
dump_dir |
str
|
Directory to dump state in (if enabled) |
max_job_idle_time |
int
|
Maximum time in seconds that a job can be idle for before its states are automatically deleted from the server. |
Source code in zeus/optimizer/pipeline_frequency/common.py
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 |
|
Config
Configuration class read by pydantic.
Source code in zeus/optimizer/pipeline_frequency/common.py
67 68 69 70 |
|
_fix_scheduler_import_path
_fix_scheduler_import_path(value)
Prepend zeus.optimizer.pipeline_frequency.server.scheduler.
to the scheduler type name.
Source code in zeus/optimizer/pipeline_frequency/common.py
46 47 48 49 |
|
_validate_scheduler_args
_validate_scheduler_args(args, values)
Check whether args are as expected by the scheduler's constructor.
Source code in zeus/optimizer/pipeline_frequency/common.py
51 52 53 54 55 56 57 58 59 60 61 |
|
JobInfo
Bases: BaseModel
Training job information reported to the server.
Attributes:
Name | Type | Description |
---|---|---|
job_id |
str
|
Globally unique ID of the training job, generated by the server. This field should be an empty string when sent to the server. |
pp_degree |
int
|
Pipeline parallel degree. |
dp_degree |
int
|
Data parallel degree. |
tp_degree |
int
|
Tensor parallel degree. |
world_size |
int
|
World size of the training job. |
job_metadata |
Optional[str]
|
An optional arbitrary string that describes the job. This will be appended to the job ID if given. Typically for logging purposes. |
Source code in zeus/optimizer/pipeline_frequency/common.py
73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 |
|
_check_world_size
_check_world_size(world_size, values)
Product of PP, DP, and TP degree would be identical to the world size.
Source code in zeus/optimizer/pipeline_frequency/common.py
99 100 101 102 103 104 105 106 |
|
set_job_id
set_job_id(scheduler_name)
Generate and set the job ID.
Source code in zeus/optimizer/pipeline_frequency/common.py
108 109 110 111 112 113 114 115 116 117 118 119 120 |
|
RankInfo
Bases: BaseModel
Information passed to the server from each rank.
Attributes:
Name | Type | Description |
---|---|---|
rank |
int
|
Global rank of the reporting process. |
dp_rank |
int
|
Data parallel rank of the reporting procees. |
pp_rank |
int
|
Pipeline parallel rank of the reporting procees. |
tp_rank |
int
|
Tensor parallel rank of the reporting procees. |
available_frequencies |
list[int]
|
List of available frequencies for the rank's GPU. |
Source code in zeus/optimizer/pipeline_frequency/common.py
123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 |
|
FrequencySchedule
Bases: BaseModel
Frequency schedule for one iteration.
frequencies
is a list of tuples, where the first element is the name of the
instruction and the second element is the frequency to use for that instruction.
Source code in zeus/optimizer/pipeline_frequency/common.py
141 142 143 144 145 146 147 148 149 |
|
ProfilingResult
Bases: BaseModel
Profiling results for a FrequencySchedule
of a rank.
Attributes:
Name | Type | Description |
---|---|---|
rank |
int
|
Global rank of the reporting client. |
iter_time |
list[float]
|
List of latency of all iterations within the profiling window in seconds. |
iter_energy |
list[float]
|
List of energy consumption of all iterations within the profiling window in Joules. |
time_breakdown |
dict[str, list[list[float]]]
|
Duration of each operation across multiple iterations.
e.g. |
energy_breakdown |
dict[str, list[list[float]]]
|
Energy consumption of each operation across multple iterations.
Value has the same structure as |
Source code in zeus/optimizer/pipeline_frequency/common.py
152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 |
|
OfflineProfilingResult
Bases: BaseModel
Profiling results generated from offline profiling each instruction.
Attributes:
Name | Type | Description |
---|---|---|
rank |
int
|
Global rank of the reporting client. |
dp_rank |
int
|
Data parallel rank of the reporting procees. |
pp_rank |
int
|
Pipeline parallel rank of the reporting procees. |
tp_rank |
int
|
Tensor parallel rank of the reporting procees. |
forward_time |
dict[int, float]
|
Dict that maps frequency to average forward computation time. |
forward_energy |
dict[int, float]
|
Dict that maps frequency to average forward computation energy. |
backward_time |
dict[int, float]
|
Dict that maps frequency to average backward computation time. |
backward_energy |
dict[int, float]
|
Dict that maps frequency to average backward computation energy. |
Source code in zeus/optimizer/pipeline_frequency/common.py
173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 |
|
InstructionProfilingResult
Bases: BaseModel
Time and energy profiling results for each instruction in each stage.
Source code in zeus/optimizer/pipeline_frequency/common.py
197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 |
|
to_csv
to_csv(filepath)
Serialize and save this object into a CSV file.
Columns: rank, dp_rank, pp_rank, tp_rank, stage, instruction, frequency, time, energy
Notes
- rank
is the global rank of the process.
- pp_rank
and stage
are always the same, for backwards compatibility.
- All ranks and stage
are zero-indexed.
- instruction
is either "forward" or "backward".
- time
and energy
are already averaged over profiling iterations.
Source code in zeus/optimizer/pipeline_frequency/common.py
202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 |
|
save_prof
async
save_prof(data, directory, schedule_num)
Save a list of ProfilingResult
s in the designated directory.
Source code in zeus/optimizer/pipeline_frequency/common.py
231 232 233 234 235 236 237 238 239 240 |
|
load_prof
load_prof(directory, schedule_num)
Load a list of ProfilingResult
s saved in the designated directory.
Source code in zeus/optimizer/pipeline_frequency/common.py
243 244 245 246 |
|
save_sched
async
save_sched(data, directory, schedule_num)
Save a list of FrequencySchedule
s in the designated directory.
Source code in zeus/optimizer/pipeline_frequency/common.py
249 250 251 252 253 254 255 256 257 258 |
|
load_sched
load_sched(directory, schedule_num)
Load a list of FrequencySchedule
s saved in the designated directory.
Source code in zeus/optimizer/pipeline_frequency/common.py
261 262 263 264 |
|
save_ranks
async
save_ranks(data, directory)
Save a list of RankInfo
s in the designated directory.
Source code in zeus/optimizer/pipeline_frequency/common.py
267 268 269 270 271 272 |
|
load_ranks
load_ranks(directory)
Load a list of RankInfo
s saved in the designated directory.
Source code in zeus/optimizer/pipeline_frequency/common.py
275 276 277 278 |
|