common
zeus.optimizer.perseus.common
Shared constants and models between the Perseus server and the client (optimizer).
PerseusSettings
Bases: BaseSettings
Perseus settings, configurable via environment variables.
For instance, setting PERSEUS_SCHEDULER=AllMaxFrequency
will automatically
import zeus.optimizer.perseus.server.scheduler.AllMaxFrequency
and
the scheduler
variable will hold it a reference to the class.
Attributes:
Name | Type | Description |
---|---|---|
scheduler |
PyObject
|
Name of the |
scheduler_args |
dict[str, Any]
|
Any extra arguments required by |
log_level |
str
|
Log level, e.g. "debug", "info". |
dump_data |
bool
|
Whether the scheduler should dump internal state to the filesystem (for future inspection purposes). |
dump_dir |
str
|
Directory to dump state in (if enabled) |
max_job_idle_time |
int
|
Maximum time in seconds that a job can be idle for before its states are automatically deleted from the server. |
Source code in zeus/optimizer/perseus/common.py
36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 |
|
Config
Configuration class read by pydantic.
Source code in zeus/optimizer/perseus/common.py
82 83 84 85 |
|
_fix_scheduler_import_path
1 |
|
Prepend zeus.optimizer.perseus.server.scheduler.
to the scheduler type name.
Source code in zeus/optimizer/perseus/common.py
61 62 63 64 |
|
_validate_scheduler_args
1 |
|
Check whether args are as expected by the scheduler's constructor.
Source code in zeus/optimizer/perseus/common.py
66 67 68 69 70 71 72 73 74 75 76 |
|
JobInfo
Bases: BaseModel
Training job information reported to the server.
Attributes:
Name | Type | Description |
---|---|---|
job_id |
str
|
Globally unique ID of the training job, generated by the server. This field should be an empty string when sent to the server. |
pp_degree |
int
|
Pipeline parallel degree. |
dp_degree |
int
|
Data parallel degree. |
tp_degree |
int
|
Tensor parallel degree. |
world_size |
int
|
World size of the training job. |
job_metadata |
Optional[str]
|
An optional arbitrary string that describes the job. This will be appended to the job ID if given. Typically for logging purposes. |
Source code in zeus/optimizer/perseus/common.py
88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 |
|
_check_world_size
1 |
|
Product of PP, DP, and TP degree would be identical to the world size.
Source code in zeus/optimizer/perseus/common.py
114 115 116 117 118 119 120 121 |
|
set_job_id
1 |
|
Generate and set the job ID.
Source code in zeus/optimizer/perseus/common.py
123 124 125 126 127 128 129 130 131 132 133 134 135 |
|
RankInfo
Bases: BaseModel
Information passed to the server from each rank.
Attributes:
Name | Type | Description |
---|---|---|
rank |
int
|
Global rank of the reporting process. |
dp_rank |
int
|
Data parallel rank of the reporting procees. |
pp_rank |
int
|
Pipeline parallel rank of the reporting procees. |
tp_rank |
int
|
Tensor parallel rank of the reporting procees. |
available_frequencies |
list[int]
|
List of available frequencies for the rank's GPU. |
Source code in zeus/optimizer/perseus/common.py
138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 |
|
FrequencySchedule
Bases: BaseModel
Frequency schedule for one iteration.
frequencies
is a list of tuples, where the first element is the name of the
instruction and the second element is the frequency to use for that instruction.
Source code in zeus/optimizer/perseus/common.py
156 157 158 159 160 161 162 163 164 |
|
ProfilingResult
Bases: BaseModel
Profiling results for a FrequencySchedule
of a rank.
Attributes:
Name | Type | Description |
---|---|---|
rank |
int
|
Global rank of the reporting client. |
iter_time |
list[float]
|
List of latency of all iterations within the profiling window in seconds. |
iter_energy |
list[float]
|
List of energy consumption of all iterations within the profiling window in Joules. |
time_breakdown |
dict[str, list[list[float]]]
|
Duration of each operation across multiple iterations.
e.g. |
energy_breakdown |
dict[str, list[list[float]]]
|
Energy consumption of each operation across multple iterations.
Value has the same structure as |
Source code in zeus/optimizer/perseus/common.py
167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 |
|
OfflineProfilingResult
Bases: BaseModel
Profiling results generated from offline profiling each instruction.
Attributes:
Name | Type | Description |
---|---|---|
rank |
int
|
Global rank of the reporting client. |
dp_rank |
int
|
Data parallel rank of the reporting procees. |
pp_rank |
int
|
Pipeline parallel rank of the reporting procees. |
tp_rank |
int
|
Tensor parallel rank of the reporting procees. |
forward_time |
dict[int, float]
|
Dict that maps frequency to average forward computation time. |
forward_energy |
dict[int, float]
|
Dict that maps frequency to average forward computation energy. |
backward_time |
dict[int, float]
|
Dict that maps frequency to average backward computation time. |
backward_energy |
dict[int, float]
|
Dict that maps frequency to average backward computation energy. |
Source code in zeus/optimizer/perseus/common.py
188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 |
|
InstructionProfilingResult
Bases: BaseModel
Time and energy profiling results for each instruction in each stage.
Source code in zeus/optimizer/perseus/common.py
212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 |
|
to_csv
1 |
|
Serialize and save this object into a CSV file.
Columns: rank, dp_rank, pp_rank, tp_rank, stage, instruction, frequency, time, energy
Notes
- rank
is the global rank of the process.
- pp_rank
and stage
are always the same, for backwards compatibility.
- All ranks and stage
are zero-indexed.
- instruction
is either "forward" or "backward".
- time
and energy
are already averaged over profiling iterations.
Source code in zeus/optimizer/perseus/common.py
217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 |
|
save_prof
async
1 |
|
Save a list of ProfilingResult
s in the designated directory.
Source code in zeus/optimizer/perseus/common.py
246 247 248 249 250 251 252 253 254 255 |
|
load_prof
1 |
|
Load a list of ProfilingResult
s saved in the designated directory.
Source code in zeus/optimizer/perseus/common.py
258 259 260 261 |
|
save_sched
async
1 |
|
Save a list of FrequencySchedule
s in the designated directory.
Source code in zeus/optimizer/perseus/common.py
264 265 266 267 268 269 270 271 272 273 |
|
load_sched
1 |
|
Load a list of FrequencySchedule
s saved in the designated directory.
Source code in zeus/optimizer/perseus/common.py
276 277 278 279 |
|
save_ranks
async
1 |
|
Save a list of RankInfo
s in the designated directory.
Source code in zeus/optimizer/perseus/common.py
282 283 284 285 286 287 |
|
load_ranks
1 |
|
Load a list of RankInfo
s saved in the designated directory.
Source code in zeus/optimizer/perseus/common.py
290 291 292 293 |
|