MachO 文件结构分析

上一篇文章中提到了如何自建一个 Crash 平台,其中通过对系统库 (MachO) 的结构解析来寻找崩溃符号。这篇文章就具体讲讲 MachO 文件的结构分析。

iOS中,我们平时看见的 MachO 文件你肯定不陌生,包括静态库(.a)、dSym (yourAppName.dSym)、系统动态库 (/usr/lib/libobjc.A.dylib)、可执行文件等。具体类型下面会讲到。

MachO 二进制文件可以根据前四字节的magic_num来判断是不是 Fat (包含一个或多个架构,有 Fat_Header), 每个架构同样是的 MachO文件。可以这样比喻,相当于对一个或多个文件用文件夹压缩了下。zip 包相当于 Fat,文件是 Thin。每个文件的内部结构式一致的。

Fat


可以看到 Fat 多了 Fat_Header信息, 信息中包含架构数,每个架构的基本信息。
Fat 可以通过lipo -thin 命令分解出 thin。 thin 也可以合并成 Fat。

1
2
3
4
5
//分解
lipo BICrashAnalyzeDemo -thin arm64 -output crashAnalyzeDemoARM64
lipo BICrashAnalyzeDemo -thin armv7 -output crashAnalyzeDemoARMV7
//合并
lipo crashAnalyzeDemoARM64 crashAnalyzeDemoARMV7 -create -output BICrashAnalyzeDemo

Thin


所以我们只需要了解Thin的 MachO 文件内部结构。
macho大致结构如下:

macho 文件的大致信息,包含文件类型、32位还是64(MH_MAGIC_64)、架构、lc 个数,lc 大小

1
2
3
4
5
6
7
8
9
struct mach_header {
uint32_t magic; /* mach magic number identifier */
cpu_type_t cputype; /* cpu specifier */
cpu_subtype_t cpusubtype; /* machine specifier */
uint32_t filetype; /* type of file */
uint32_t ncmds; /* number of load commands */
uint32_t sizeofcmds; /* the size of all the load commands */
uint32_t flags; /* flags */
};

  • magic:MachO文件的魔数,用来确定其属于64位(0xfeedfacf/MH_MAGIC_64)还是32位(0xfeedface/MH_MAGIC),分别对应的是arm64和 armv7的Header。
  • cputype和cupsubtype代表的是cpu的类型和其子类型,定义如下:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    #define CPU_TYPE_ARM ((cpu_type_t)12)
    #define CPU_SUBTYPE_ARM_V6 ((cpu_subtype_t)6)
    #define CPU_SUBTYPE_ARM_V7 ((cpu_subtype_t)9)
    #define CPU_SUBTYPE_ARM_V7S ((cpu_subtype_t)11)

    #define CPU_TYPE_ARM64 ((cpu_type_t)16777228)
    #define CPU_SUBTYPE_ARM64_ALL ((cpu_subtype_t)0)

    #define CPU_TYPE_I386 ((cpu_type_t)7)
    #define CPU_SUBTYPE_X86_ALL ((cpu_subtype_t)3)

  • filetype 上面提到的 filetype,例子中为 dSym
    可以看到苹果源文件中包含的所有类型。类型包含在 MachO 的 mach_header(_64) 的filetype字段

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
     *
    * Constants for the filetype field of the mach_header
    */
    #define MH_OBJECT 0x1 /* relocatable object file */可重定位的目标文件,编译器对源代码编译得到的中间结果。如gcc-c生成的
    #define MH_EXECUTE 0x2 /* demand paged executable file */ 可执行文件(应用程序生成的二进制文件),
    #define MH_FVMLIB 0x3 /* fixed VM shared library file */
    #define MH_CORE 0x4 /* core file */
    #define MH_PRELOAD 0x5 /* preloaded executable file */
    #define MH_DYLIB 0x6 /* dynamically bound shared library */ 动态库
    #define MH_DYLINKER 0x7 /* dynamic link editor */ 动态链接库
    #define MH_BUNDLE 0x8 /* dynamically bound bundle file */
    #define MH_DYLIB_STUB 0x9 /* shared library stub for static */
    /* linking only, no section contents */
    #define MH_DSYM 0xa /* companion file with only debug */ dSym 文件 gcc-g生成
    /* sections */
    #define MH_KEXT_BUNDLE 0xb /* x86_64 kexts */ 64位内核扩展
  • ncmds load command 个数

  • sizeofcmds 所有 load command 大小
  • flags 执行相关的一些设置,用途如下
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    #define	MH_NOUNDEFS	0x1		/* the object file has no undefined
    references */
    #define MH_INCRLINK 0x2 /* the object file is the output of an
    incremental link against a base file
    and can't be link edited again */
    #define MH_DYLDLINK 0x4 /* the object file is input for the
    dynamic linker and can't be staticly
    link edited again */
    #define MH_BINDATLOAD 0x8 /* the object file's undefined
    references are bound by the dynamic
    linker when loaded. */
    #define MH_PREBOUND 0x10 /* the file has its dynamic undefined
    references prebound. */
    #define MH_SPLIT_SEGS 0x20 /* the file has its read-only and
    read-write segments split */
    #define MH_LAZY_INIT 0x40 /* the shared library init routine is
    to be run lazily via catching memory
    faults to its writeable segments
    (obsolete) */
    #define MH_TWOLEVEL 0x80 /* the image is using two-level name
    space bindings */
    #define MH_FORCE_FLAT 0x100 /* the executable is forcing all images
    to use flat name space bindings */
    #define MH_NOMULTIDEFS 0x200 /* this umbrella guarantees no multiple
    defintions of symbols in its
    sub-images so the two-level namespace
    hints can always be used. */
    #define MH_NOFIXPREBINDING 0x400 /* do not have dyld notify the
    prebinding agent about this
    executable */
    #define MH_PREBINDABLE 0x800 /* the binary is not prebound but can
    have its prebinding redone. only used
    when MH_PREBOUND is not set. */
    #define MH_ALLMODSBOUND 0x1000 /* indicates that this binary binds to
    all two-level namespace modules of
    its dependent libraries. only used
    when MH_PREBINDABLE and MH_TWOLEVEL
    are both set. */
    #define MH_SUBSECTIONS_VIA_SYMBOLS 0x2000/* safe to divide up the sections into
    sub-sections via symbols for dead
    code stripping */
    #define MH_CANONICAL 0x4000 /* the binary has been canonicalized
    via the unprebind operation */
    #define MH_WEAK_DEFINES 0x8000 /* the final linked image contains
    external weak symbols */
    #define MH_BINDS_TO_WEAK 0x10000 /* the final linked image uses
    weak symbols */

    #define MH_ALLOW_STACK_EXECUTION 0x20000/* When this bit is set, all stacks
    in the task will be given stack
    execution privilege. Only used in
    MH_EXECUTE filetypes. */
    #define MH_ROOT_SAFE 0x40000 /* When this bit is set, the binary
    declares it is safe for use in
    processes with uid zero */

    #define MH_SETUID_SAFE 0x80000 /* When this bit is set, the binary
    declares it is safe for use in
    processes when issetugid() is true */

    #define MH_NO_REEXPORTED_DYLIBS 0x100000 /* When this bit is set on a dylib,
    the static linker does not need to
    examine dependent dylibs to see
    if any are re-exported */
    #define MH_PIE 0x200000 /* When this bit is set, the OS will
    load the main executable at a
    random address. Only used in
    MH_EXECUTE filetypes. */
    #define MH_DEAD_STRIPPABLE_DYLIB 0x400000 /* Only for use on dylibs. When
    linking against a dylib that
    has this bit set, the static linker
    will automatically not create a
    LC_LOAD_DYLIB load command to the
    dylib if no symbols are being
    referenced from the dylib. */
    #define MH_HAS_TLV_DESCRIPTORS 0x800000 /* Contains a section of type
    S_THREAD_LOCAL_VARIABLES */

    #define MH_NO_HEAP_EXECUTION 0x1000000 /* When this bit is set, the OS will
    run the main executable with
    a non-executable heap even on
    platforms (e.g. i386) that don't
    require it. Only used in MH_EXECUTE
    filetypes. */
    用途

Load Commands

用于告诉loader如何设置并加载二进制数据

1
2
3
4
5
//基本信息
struct load_command {
uint32_t cmd; /* type of load command */
uint32_t cmdsize; /* total size of command in bytes */
};

cmd类型有很多,查看 loader.h中的定义
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
/*
* After MacOS X 10.1 when a new load command is added that is required to be
* understood by the dynamic linker for the image to execute properly the
* LC_REQ_DYLD bit will be or'ed into the load command constant. If the dynamic
* linker sees such a load command it it does not understand will issue a
* "unknown load command required for execution" error and refuse to use the
* image. Other load commands without this bit that are not understood will
* simply be ignored.
*/
#define LC_REQ_DYLD 0x80000000

/* Constants for the cmd field of all load commands, the type */
#define LC_SEGMENT 0x1 /* segment of this file to be mapped */
#define LC_SYMTAB 0x2 /* link-edit stab symbol table info */
#define LC_SYMSEG 0x3 /* link-edit gdb symbol table info (obsolete) */
#define LC_THREAD 0x4 /* thread */
#define LC_UNIXTHREAD 0x5 /* unix thread (includes a stack) */
#define LC_LOADFVMLIB 0x6 /* load a specified fixed VM shared library */
#define LC_IDFVMLIB 0x7 /* fixed VM shared library identification */
#define LC_IDENT 0x8 /* object identification info (obsolete) */
#define LC_FVMFILE 0x9 /* fixed VM file inclusion (internal use) */
#define LC_PREPAGE 0xa /* prepage command (internal use) */
#define LC_DYSYMTAB 0xb /* dynamic link-edit symbol table info */
#define LC_LOAD_DYLIB 0xc /* load a dynamically linked shared library */
#define LC_ID_DYLIB 0xd /* dynamically linked shared lib ident */
#define LC_LOAD_DYLINKER 0xe /* load a dynamic linker */
#define LC_ID_DYLINKER 0xf /* dynamic linker identification */
#define LC_PREBOUND_DYLIB 0x10 /* modules prebound for a dynamically */
/* linked shared library */
#define LC_ROUTINES 0x11 /* image routines */
#define LC_SUB_FRAMEWORK 0x12 /* sub framework */
#define LC_SUB_UMBRELLA 0x13 /* sub umbrella */
#define LC_SUB_CLIENT 0x14 /* sub client */
#define LC_SUB_LIBRARY 0x15 /* sub library */
#define LC_TWOLEVEL_HINTS 0x16 /* two-level namespace lookup hints */
#define LC_PREBIND_CKSUM 0x17 /* prebind checksum */

/*
* load a dynamically linked shared library that is allowed to be missing
* (all symbols are weak imported).
*/
#define LC_LOAD_WEAK_DYLIB (0x18 | LC_REQ_DYLD)

#define LC_SEGMENT_64 0x19 /* 64-bit segment of this file to be
mapped */
#define LC_ROUTINES_64 0x1a /* 64-bit image routines */
#define LC_UUID 0x1b /* the uuid */
#define LC_RPATH (0x1c | LC_REQ_DYLD) /* runpath additions */
#define LC_CODE_SIGNATURE 0x1d /* local of code signature */
#define LC_SEGMENT_SPLIT_INFO 0x1e /* local of info to split segments */
#define LC_REEXPORT_DYLIB (0x1f | LC_REQ_DYLD) /* load and re-export dylib */
#define LC_LAZY_LOAD_DYLIB 0x20 /* delay load of dylib until first use */
#define LC_ENCRYPTION_INFO 0x21 /* encrypted segment information */
#define LC_DYLD_INFO 0x22 /* compressed dyld information */
#define LC_DYLD_INFO_ONLY (0x22|LC_REQ_DYLD) /* compressed dyld information only */
#define LC_LOAD_UPWARD_DYLIB (0x23 | LC_REQ_DYLD) /* load upward dylib */
#define LC_VERSION_MIN_MACOSX 0x24 /* build for MacOSX min OS version */
#define LC_VERSION_MIN_IPHONEOS 0x25 /* build for iPhoneOS min OS version */
#define LC_FUNCTION_STARTS 0x26 /* compressed table of function start addresses */
#define LC_DYLD_ENVIRONMENT 0x27 /* string for dyld to treat
like environment variable */
#define LC_MAIN (0x28|LC_REQ_DYLD) /* replacement for LC_UNIXTHREAD */
#define LC_DATA_IN_CODE 0x29 /* table of non-instructions in __text */
#define LC_SOURCE_VERSION 0x2A /* source version used to build binary */
#define LC_DYLIB_CODE_SIGN_DRS 0x2B /* Code signing DRs copied from linked dylibs */

常用的有
LC_UUID:确定文件的唯一标识,crash reporter 中的 Images 中也会有这个,去检测 dsym 文件和 crash 文件是否匹配,系统库是否找的正确。
LC_SEGMENT_64: 将该段(64位)映射到进程地址空间中
LC_SEGMENT: 将该段(32位)映射到进程地址空间中
LC_SYMTAB:载入符号表地址,可以通过崩溃地址找到崩溃符号(方法)。
LC_DYSYMTAB:载入动态符号表地址
LC_LOAD_DYLINKER:load_dylinker, 调用dyld(usr/lib/dyld) 动态连接器加载动态库
LC_VERSION_MIN_MACOSX/LC_VERSION_MIN_IPHONEOS:确定二进制文件要求的最低操作系统版本
LC_SOURCE_VERSION:构建该二进制文件使用的源代码版本
LC_MAIN:设置程序主线程的入口地址和栈大小, 在可执行文件中。
1
2
3
4
5
6
7
8
9
10
11
12
/*
* The entry_point_command is a replacement for thread_command.
* It is used for main executables to specify the location (file offset)
* of main(). If -stack_size was used at link time, the stacksize
* field will contain the stack size need for the main thread.
*/
struct entry_point_command {
uint32_t cmd; /* LC_MAIN only used in MH_EXECUTE filetypes */
uint32_t cmdsize; /* 24 */
uint64_t entryoff; /* file (__TEXT) offset of main() */
uint64_t stacksize;/* if not zero, initial stack size */
};

LC_ENCRYPTION_INFO_64:获取加密信息
LC_LOAD_DYLIB:加载额外的动态库路径,包含三方库
LC_FUNCTION_STARTS:定义一个函数起始地址表,使调试器和其他程序易于看到一个地址是否在函数内
LC_DATA_IN_CODE:定义在代码段内的非指令的表
LC_ID_DYLIB:只在 dylib 中加载,指定了 dylib的ID,版本和兼容版本
LC_CODE_SIGNATURE:获取应用签名信息
LC_DYLD_INFO_ONLY:加载动态链接库信息(重定向地址、弱引用绑定、懒加载绑定、开放函数等的偏移值等信息)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
/*
* The dyld_info_command contains the file offsets and sizes of
* the new compressed form of the information dyld needs to
* load the image. This information is used by dyld on Mac OS X
* 10.6 and later. All information pointed to by this command
* is encoded using byte streams, so no endian swapping is needed
* to interpret it.
*/
struct dyld_info_command {
uint32_t cmd; /* LC_DYLD_INFO or LC_DYLD_INFO_ONLY */
uint32_t cmdsize; /* sizeof(struct dyld_info_command) */

uint32_t rebase_off; /* file offset to rebase info */
uint32_t rebase_size; /* size of rebase info */

uint32_t bind_off; /* file offset to binding info */
uint32_t bind_size; /* size of binding info */

uint32_t weak_bind_off; /* file offset to weak binding info */
uint32_t weak_bind_size; /* size of weak binding info */

uint32_t lazy_bind_off; /* file offset to lazy binding info */
uint32_t lazy_bind_size; /* size of lazy binding infs */

uint32_t export_off; /* file offset to lazy binding info */
uint32_t export_size; /* size of lazy binding infs */
};

Data

这个区域提供了各个段(Segment)和节(Section)在可执行文件中的位置和大小。这个区域完整的描述克可执行文件中的全部内容。

存放数据:代码、字符常量、类、方法等
可以拥有多个segment,每个segment可以有零到多个section。每个段都有一段虚拟地址映射到进程的地址空间

segment

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
/*
* The segment load command indicates that a part of this file is to be
* mapped into the task's address space. The size of this segment in memory,
* vmsize, maybe equal to or larger than the amount to map from this file,
* filesize. The file is mapped starting at fileoff to the beginning of
* the segment in memory, vmaddr. The rest of the memory of the segment,
* if any, is allocated zero fill on demand. The segment's maximum virtual
* memory protection and initial virtual memory protection are specified
* by the maxprot and initprot fields. If the segment has sections then the
* section structures directly follow the segment command and their size is
* reflected in cmdsize.
*/
struct segment_command { /* for 32-bit architectures */
uint32_t cmd; /* LC_SEGMENT */
uint32_t cmdsize; /* includes sizeof section structs */
char segname[16]; /* segment name */
uint32_t vmaddr; /* memory address of this segment */
uint32_t vmsize; /* memory size of this segment */
uint32_t fileoff; /* file offset of this segment */
uint32_t filesize; /* amount to map from the file */
vm_prot_t maxprot; /* maximum VM protection */
vm_prot_t initprot; /* initial VM protection */
uint32_t nsects; /* number of sections in segment */
uint32_t flags; /* flags */
};

/*
* The 64-bit segment load command indicates that a part of this file is to be
* mapped into a 64-bit task's address space. If the 64-bit segment has
* sections then section_64 structures directly follow the 64-bit segment
* command and their size is reflected in cmdsize.
*/
struct segment_command_64 { /* for 64-bit architectures */
uint32_t cmd; /* LC_SEGMENT_64 */
uint32_t cmdsize; /* includes sizeof section_64 structs */
char segname[16]; /* segment name */
uint64_t vmaddr; /* memory address of this segment */ 段的虚拟内存地址
uint64_t vmsize; /* memory size of this segment */ 为这个段分配的虚拟内存大小
uint64_t fileoff; /* file offset of this segment */ 段在文件中起始地址
uint64_t filesize; /* amount to map from the file */ 段大小
vm_prot_t maxprot; /* maximum VM protection */
vm_prot_t initprot; /* initial VM protection */
uint32_t nsects; /* number of sections in segment */ section个数
uint32_t flags; /* flags */ 标志位
};

对于每一个段,将文件中相对应的内容加载到内存中:从偏移量为fileoff处加载filesize字节到虚拟内存地址vmaddr处的vmsize字节。每一个段的页面都根据initprot进行初始化,initprot指定了如何通过读/写/执行位初始化页面保护级别。段的保护设置可以动态改变,但是不能超过maxprot中指定的值(iOS中,+x 和+w 是互斥的)。

section
section分为两种

TEXT 代码段 DATA 数据段

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
/*
* A segment is made up of zero or more sections.
*/
struct section { /* for 32-bit architectures */
char sectname[16]; /* name of this section */
char segname[16]; /* segment this section goes in */
uint32_t addr; /* memory address of this section */
uint32_t size; /* size in bytes of this section */
uint32_t offset; /* file offset of this section */
uint32_t align; /* section alignment (power of 2) */
uint32_t reloff; /* file offset of relocation entries */
uint32_t nreloc; /* number of relocation entries */
uint32_t flags; /* flags (section type and attributes)*/
uint32_t reserved1; /* reserved (for offset or index) */
uint32_t reserved2; /* reserved (for count or sizeof) */
};

struct section_64 { /* for 64-bit architectures */
char sectname[16]; /* name of this section */
char segname[16]; /* segment this section goes in */
uint64_t addr; /* memory address of this section */
uint64_t size; /* size in bytes of this section */
uint32_t offset; /* file offset of this section */
uint32_t align; /* section alignment (power of 2) */
uint32_t reloff; /* file offset of relocation entries */
uint32_t nreloc; /* number of relocation entries */
uint32_t flags; /* flags (section type and attributes)*/
uint32_t reserved1; /* reserved (for offset or index) */
uint32_t reserved2; /* reserved (for count or sizeof) */
uint32_t reserved3; /* reserved */
};

常见section作用

  • 每个 section 的起始位置 + 大小就是下个 section 的启始位置
  • cstring 包含了所有OC的硬编码字符串,nslog 中的字符串也在cstring中,但是不包含含中文的字符串,这些字符串显示的存储在数据段中。不过同样的字符只会存储一次。
  • ustring 中包含了带中文的硬编码字符串,其他同cstring。
  • __stubs会到 __DATA 段的__la_symbol_ptr中找到函数的入口地址。

Loader Info (链接信息)

一个完整的用户级 MachO 文件的末端是一系列链接信息。其中包含了动态加载器用来链接可执行文件或者依赖所需使用的符号表、字符串表、dynamic loader info(地址修正,bind)。