common
zeus.optimizer.perseus.common
Shared constants and models between the Perseus server and the client (optimizer).
PerseusSettings
Bases: BaseSettings
Perseus settings, configurable via environment variables.
For instance, setting PERSEUS_SCHEDULER=AllMaxFrequency
will automatically
import zeus.optimizer.perseus.server.scheduler.AllMaxFrequency
and
the scheduler
variable will hold it a reference to the class.
Attributes:
Name | Type | Description |
---|---|---|
scheduler |
PyObject
|
Name of the |
scheduler_args |
dict[str, Any]
|
Any extra arguments required by |
log_level |
str
|
Log level, e.g. "debug", "info". |
dump_data |
bool
|
Whether the scheduler should dump internal state to the filesystem (for future inspection purposes). |
dump_dir |
str
|
Directory to dump state in (if enabled) |
max_job_idle_time |
int
|
Maximum time in seconds that a job can be idle for before its states are automatically deleted from the server. |
Source code in zeus/optimizer/perseus/common.py
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 |
|
Config
Configuration class read by pydantic.
Source code in zeus/optimizer/perseus/common.py
81 82 83 84 |
|
_fix_scheduler_import_path
_fix_scheduler_import_path(value)
Prepend zeus.optimizer.perseus.server.scheduler.
to the scheduler type name.
Source code in zeus/optimizer/perseus/common.py
60 61 62 63 |
|
_validate_scheduler_args
_validate_scheduler_args(args, values)
Check whether args are as expected by the scheduler's constructor.
Source code in zeus/optimizer/perseus/common.py
65 66 67 68 69 70 71 72 73 74 75 |
|
JobInfo
Bases: BaseModel
Training job information reported to the server.
Attributes:
Name | Type | Description |
---|---|---|
job_id |
str
|
Globally unique ID of the training job, generated by the server. This field should be an empty string when sent to the server. |
pp_degree |
int
|
Pipeline parallel degree. |
dp_degree |
int
|
Data parallel degree. |
tp_degree |
int
|
Tensor parallel degree. |
world_size |
int
|
World size of the training job. |
job_metadata |
Optional[str]
|
An optional arbitrary string that describes the job. This will be appended to the job ID if given. Typically for logging purposes. |
Source code in zeus/optimizer/perseus/common.py
87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 |
|
_check_world_size
_check_world_size(world_size, values)
Product of PP, DP, and TP degree would be identical to the world size.
Source code in zeus/optimizer/perseus/common.py
113 114 115 116 117 118 119 120 |
|
set_job_id
set_job_id(scheduler_name)
Generate and set the job ID.
Source code in zeus/optimizer/perseus/common.py
122 123 124 125 126 127 128 129 130 131 132 133 134 |
|
RankInfo
Bases: BaseModel
Information passed to the server from each rank.
Attributes:
Name | Type | Description |
---|---|---|
rank |
int
|
Global rank of the reporting process. |
dp_rank |
int
|
Data parallel rank of the reporting procees. |
pp_rank |
int
|
Pipeline parallel rank of the reporting procees. |
tp_rank |
int
|
Tensor parallel rank of the reporting procees. |
available_frequencies |
list[int]
|
List of available frequencies for the rank's GPU. |
Source code in zeus/optimizer/perseus/common.py
137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 |
|
FrequencySchedule
Bases: BaseModel
Frequency schedule for one iteration.
frequencies
is a list of tuples, where the first element is the name of the
instruction and the second element is the frequency to use for that instruction.
Source code in zeus/optimizer/perseus/common.py
155 156 157 158 159 160 161 162 163 |
|
ProfilingResult
Bases: BaseModel
Profiling results for a FrequencySchedule
of a rank.
Attributes:
Name | Type | Description |
---|---|---|
rank |
int
|
Global rank of the reporting client. |
iter_time |
list[float]
|
List of latency of all iterations within the profiling window in seconds. |
iter_energy |
list[float]
|
List of energy consumption of all iterations within the profiling window in Joules. |
time_breakdown |
dict[str, list[list[float]]]
|
Duration of each operation across multiple iterations.
e.g. |
energy_breakdown |
dict[str, list[list[float]]]
|
Energy consumption of each operation across multple iterations.
Value has the same structure as |
Source code in zeus/optimizer/perseus/common.py
166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 |
|
OfflineProfilingResult
Bases: BaseModel
Profiling results generated from offline profiling each instruction.
Attributes:
Name | Type | Description |
---|---|---|
rank |
int
|
Global rank of the reporting client. |
dp_rank |
int
|
Data parallel rank of the reporting procees. |
pp_rank |
int
|
Pipeline parallel rank of the reporting procees. |
tp_rank |
int
|
Tensor parallel rank of the reporting procees. |
forward_time |
dict[int, float]
|
Dict that maps frequency to average forward computation time. |
forward_energy |
dict[int, float]
|
Dict that maps frequency to average forward computation energy. |
backward_time |
dict[int, float]
|
Dict that maps frequency to average backward computation time. |
backward_energy |
dict[int, float]
|
Dict that maps frequency to average backward computation energy. |
Source code in zeus/optimizer/perseus/common.py
187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 |
|
InstructionProfilingResult
Bases: BaseModel
Time and energy profiling results for each instruction in each stage.
Source code in zeus/optimizer/perseus/common.py
211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 |
|
to_csv
to_csv(filepath)
Serialize and save this object into a CSV file.
Columns: rank, dp_rank, pp_rank, tp_rank, stage, instruction, frequency, time, energy
Notes
- rank
is the global rank of the process.
- pp_rank
and stage
are always the same, for backwards compatibility.
- All ranks and stage
are zero-indexed.
- instruction
is either "forward" or "backward".
- time
and energy
are already averaged over profiling iterations.
Source code in zeus/optimizer/perseus/common.py
216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 |
|
save_prof
async
save_prof(data, directory, schedule_num)
Save a list of ProfilingResult
s in the designated directory.
Source code in zeus/optimizer/perseus/common.py
245 246 247 248 249 250 251 252 253 254 |
|
load_prof
load_prof(directory, schedule_num)
Load a list of ProfilingResult
s saved in the designated directory.
Source code in zeus/optimizer/perseus/common.py
257 258 259 260 |
|
save_sched
async
save_sched(data, directory, schedule_num)
Save a list of FrequencySchedule
s in the designated directory.
Source code in zeus/optimizer/perseus/common.py
263 264 265 266 267 268 269 270 271 272 |
|
load_sched
load_sched(directory, schedule_num)
Load a list of FrequencySchedule
s saved in the designated directory.
Source code in zeus/optimizer/perseus/common.py
275 276 277 278 |
|
save_ranks
async
save_ranks(data, directory)
Save a list of RankInfo
s in the designated directory.
Source code in zeus/optimizer/perseus/common.py
281 282 283 284 285 286 |
|
load_ranks
load_ranks(directory)
Load a list of RankInfo
s saved in the designated directory.
Source code in zeus/optimizer/perseus/common.py
289 290 291 292 |
|