Search
Clear search
Close search
Google apps
Main menu
true

Indexable File Formats

This document lists the file formats that the Google Search Appliance can crawl, index, and search.

Back to top

Overview

The following sections list word processing, spreadsheet, database, presentation, and other formats that the Google Search Appliance can crawl, index, and search. Please note the following:

  • The Google Search Appliance can directly crawl the file formats listed in this document. Other file formats can be indexed and searched by using a content feed or a connector, including, but not limited to VSD; multimedia content types such as WAV, MP3, and MP4; and dynamic content types such as CGI, PHP, ASP, and ASPX.
  • Text embedded in graphics is not indexed.

    The Google Search Appliance cannot index text contained in graphic file formats, such a JPEG, GIF, or TIFF. When a file in a graphic format is submitted for indexing, text embedded in the graphic is not indexed. However, the file name is indexed. If any metadata is associated with the graphic in HTML meta tags, that metadata is indexed. If a JPEG file has Exchangeable Image Format (EXIF) data, the data is indexed as metadata.

  • Encrypted, viewable PDF documents are converted to HTML for indexing, but the cached HTML is not displayed.
  • Encrypted Excel spreadsheets (xls format) cannot be indexed or searched. If the search appliance attempts to crawl and index an encrypted Excel spreadsheet, you see the following error on the Crawl Diagnostics page:
    Crawled with empty body: Conversion error

    To make Excel spreadsheets indexable, disable encryption on the Excel Tools > Options > Security tab and resave any affected spreadsheets.

  • PDF files created by scanning with optical character recognition (OCR) software are supported.
  • If you are using the Google Search Appliance, metadata can be fed from a database and then indexed.
  • Files in XML format are crawled and indexed as plain text. Links are not extracted or followed and XML tags are converted to escaped HTML counterparts.
  • The search appliance supports extraction and indexing of compressed content from the formats listed in Indexable Archive Formats.
  • Microsoft Office 2007 files, which have file extensions of .docx, .pptx, .xlsx, and so on, consist of ZIP archives of many XML files. The search appliance indexes most Microsoft Office 2007 files correctly. However, if the uncompressed file size is larger than 30 MB, the search appliance cannot index the file. In these cases, you see a Conversion error message on the Index > Diagnostics > Index Diagnostics page.
  • The search appliance attempts to determine the type of file it is crawling by first examining the Content-Type HTTP header and then by examining the file extension. Provided that the Content-Type header is present at crawl time, the search appliance crawls and indexes files where the content type does not match the file extension. For example, an HTML file saved with a PDF extension is correctly crawled and indexed as an HTML file.
  • If the search appliance crawls or is fed a document with a mime type that it does not know how to interpret, it associates that document with text/other and indexes it. In this case, when the document is searched for, the search appliance guesses the actual mime type from the file extension. If it cannot determine the mime type from the file extension, then the mime type returned is application/octet-stream.

Back to top

How the Google Search Appliance Determines the Document Title


The Google Search Appliance analyzes documents during the indexing process to determine which text is the document title and which is the body text. How the search appliance makes the determination varies by the document type.

If you want titles extracted from document metadata, do not use a value for the title metadata that is the same as the file name.

The search appliance ignores the title tag in a web page if it has less than three characters.

Back to top

PDF Documents

The search appliance uses the PDF document title property as the title in the search index. The values of the metadata fields are indexed as part of the document content.

Only documents without copyright protection (documents with printing, copying, and editing enabled) will show cached versions and document previews.

XLS Documents

The search appliance uses the Properties > Title property as the title in the search index. If the search appliance is unable to do this, it uses the name of the first worksheet.

Extracted document properties become metatags in the HTML representation of an XLS document. For example:


<meta http-equiv="Content-Type" content="text/html; charset=Latin1">
<meta name="Producer" content="Acrobat Distiller 4.05 for Windows">
<meta name="ModDate" content="D:20011129112148-06’00’">
<meta name="Author" content="Charles Dickens">
<meta name="CreationDate" content="D:20011129112114">
<meta name="Creator" content="Microsoft Word 9.0">

Text Documents

Text documents do not have titles associated with the document. The search appliances uses the first 70 bytes of the document as the title when it serves search results.

Back to top

Indexable Word Processing Formats


The following table lists supported word processing formats.

Format

Extension

Versions Supported

Adobe FrameMaker

mif

Versions 3.0-6.0

Adobe Illustrator Postscript

ppd

Level 2

Ami

sam

 

Ami Pro for OS2

sam

 

Ami Pro for Windows

sam

Versions 2.0, 3.0

ANSI Text (7 & 8 bit)

ans

All versions

ASCII Text (7 & 8 bit)

txt

All versions

DEC DX

dx

Versions through 4.0

DEC DX Plus

wpl

Versions 4.0, 4.1

DisplayWrite

rft, dca

Versions 2.0-5.0

DOS character set

   

EBCDIC

   

Enable

wpf

Versions 3.0-4.5

First Choice

pfc

Versions 1.0, 3.0

Framework

net

Version 3.0

Hangul

hwp

Versions 97-2007

HTML

html, htm

Versions 1.0-4.0 (some limitations)

IBM DCA/FFT

fft

All versions

IBM DCA/Revisable Form Text

rft

All versions

IBM Writing Assistant

iwa

Version 1.01

JustSystems Ichitaro

jaw, jbw, jtd

Versions 5.0, 6.0, 8.0-13.0, 2004, and 2010

JustWrite

jw

Versions through 3.0

Kingsoft WPS Writer

wps

Version 2010

Legacy

leg

Version 1.1

Lotus Manuscript

doc

Versions through 2.0

Lotus WordPro

lwp

Versions 9.7, 96-Millennium 9.6

Lotus WordPro (non Win32)

lwp

Versions 97-Millennium 9.6

Macintosh character set

   

MacWrite II

mcw, mw, mwii

Version 1.1

MASS11

m11

Versions through 8.0

Microsoft Publisher (file ID only)

pub

Versions 2003-2007

Microsoft Rich Text Format

rtf

All versions

Microsoft Word for DOS

doc

Versions 4.0-6.0

Microsoft Word for Macintosh

doc

Versions 4.0-6.0, 98-2008

Microsoft Word for Windows

doc

Versions 1.0-2010

Microsoft Word for Windows

doc

Version 2003 XML (text only via XML filter)

Microsoft Word for Windows

doc

Version 98-J

Microsoft WordPad

rtf, doc

All versions

Microsoft Works for DOS

wks, wps

Version 2.0

Microsoft Works for Macintosh

wks, wps

Version 2.0

Microsoft Works for Windows

wks, wpf

Versions 3.0, 4.0

Microsoft Write for Windows

wri

Versions 1.0-3.0

MultiMate

dox

Versions through 4.0

MultiMate Advantage

dox

Version 2.0

Navy DIF

dif

All versions

Nota Bene

nb

Version 3.0

Novell PerfectWorks

wpw

Version 2.0

Novell WordPerfect for DOS

wpd

Version 4.2

Novell WordPerfect for Mac

wpd

Versions 1.02-3.1

Novell WordPerfect for Windows

wpd

Versions 5.1-X4

Office Writer

ow4

Version 4.0-6.0

OpenOffice Writer

odt, ott

Versions 1.1-3.0

Oracle Open Office Writer

odt, ott, sxw, stw

Versions 3.x

PC File Doc

 

Version 5.0

PFS: Write

pfb

Versions A, B

Professional Write for DOS

pw

Versions 1.0, 2.0

Professional Write Plus for Windows

pw, pwp

Version 1.0

Q&A Write for Windows

dtf

Versions 2.0, 3.0

Samna Word IV

sam, sm

Versions 1.0-3.0

Samna Word IV+

sam, sm

 

Samsung Jungum Global (file ID only)

gul

 

Signature

sig

Version 1.0

SmartWare II

smt

Version 1.02

Sprint

spr

Version 1.0

StarOffice Writer

sxw, odt

Versions 5.2-9.0

Total Word

tw

Version 1.2

Unicode Text

txt

Versions 3.0, 4.0

UTF-8

utf

 

Volkswriter 3 & 4

vw

Versions through 1.0

Wang IWP

iwp

Versions through 2.6

Wireless Markup Language

wml

All versions

WordMarc

wmc

Versions through Composer Plus

WordPerfect for DOS

wpd

Version 4.2

WordPerfect for Macintosh

wpd

Versions 1.02-3.1

WordPerfect for Windows

wpd

Versions 5.1-X.4 (recheck)

WordStar 2000 for DOS

ws1, ws2, ws3

Versions 1.0-3.0

WordStar for DOS

ws

Versions 3.0-7.0

WordStar for Windows

ws, wst, wsd

Version 1.0

XML (text only)

xml

 

XHTML (file ID only)

xhtml

Version 1.0

XyWrite

xy3, xyp, xyw

Versions through III Plus

Back to top

Indexable Spreadsheet Formats


The following table lists supported spreadsheet formats.

Format

Extension

Versions Supported

Enable

300, wpf, ssf, dbf

Versions 3.0-4.5

First Choice

ss, fol

Versions through 3.0

Framework

fw3

Version 3.0

Kingsoft WPS Spreadsheets

wps

Version 2010

Lotus 1-2-3

wku, wk1, wk2, wk3, wk4, wk5, wki, wks

Versions through Millenium 9.6

Lotus 1-2-3 Charts (DOS & Windows)

wku, wk1, wk2, wk3, wk4, wk5, wki, wks

Versions through 5.0

Lotus 1-2-3 (OS/2)

wku, wk1, wk2

Versions through 2.0

Lotus Symphony

wr1

Versions 1.x through 2.0

Microsoft Excel Charts

xlc

Versions 2.x through 7.0

Microsoft Excel for Macintosh

xls

Versions 98-2008

Microsoft Excel for Windows

xls, xlw

Versions 3.0 through 2010 (2007 with extensions xlsx and xlsm)

Microsoft Excel for Windows

xlsb

Versions 2007-2010 (binary)

Microsoft Excel for Windows

xml

Version 2003 XML (text only via XML filter)

Microsoft Works (DOS)

wps, wks, wdb, wcm

Version 2.0

Microsoft Works (Windows)

wps, wks

Versions 3.0, 4.0

Microsoft Works (Macintosh)

wps, wks, wdb, wcm

Version 2.0

Multiplan

col, cod, mod

Version 4.0

Novell Perfect Works

wpw

Version 2.0

OpenOffice Calc

odc, sdc

Versions 1.1-3.0

Oracle OpenOffice Calc

ods, ots, sxc, stc

Versions 3.x

PFS: Plan

tid

Version 1.0

QuattroPro (DOS)

wkq, wq1

Versions through 5.0

QuattroPro (Windows)

wb1, wb2, wk3

Versions through X4

SmartWare II

ws

Version 1.02

SmartWare Spreadsheet

ws

 

StarOffice Calc (Windows and UNIX)

sdc, sxc, ods, ots

StarOffice versions 5.2-9.0, and OpenOffice version 1.1 (Text only)

SuperCalc 5

cal

Version 5.0

VP-Planner

np

Version 1.0

Back to top

Indexable Database Formats


The following table lists supported database formats.

Format

Extension

Versions Supported

DataEase

dba, dbm, dql

Version 4.x

DBASE

dbf

Versions III, IV, V

First Choice

pfc

Version 3.0

Framework

fwk, fw, fw2, fw3

Version 3.0

Microsoft Access

mdb

Versions 1.0, 2.0

Microsoft Access Report Snapshot (file ID only)

mdb

Versions 2000-2003

Microsoft Works (DOS)

wdb, wks

Versions 1.0, 2.0

Microsoft Works (Macintosh)

wdb, wks

Version 2. 0

Microsoft Works (Windows)

wdb, wks, dbf

Versions 3.0, 4.0

Paradox (DOS)

fsl, db, px

Versions 2.0-4.0

Paradox (Windows)

fsl, db, px

Version 1.0

Q&A

qa, qw, dtf

Versions through 2.0

R:Base 5000

rbf, dbf

R:Base 5000

R:Base System V

rbf

R:Base System V

Reflex

r2d

Version 2.0

SmartWare II

db

Version 1.02

Back to top

Indexable Graphics Formats


The following table lists supported graphics formats. Note that text that is part of a graphic is not indexed. Only file names and metadata are indexed.

Format

Extension

Versions Supported

Adobe FrameMaker Graphics

fmv

Vector/raster 3.0-5.0

Adobe Illustrator File Format

ai

Versions 4.0-7.0, 9.0

Adobe Illustrator

xmp

Versions 11-13 (CS 1-3)

Adobe InDesign

xmp

Versions 3-5 (CS 1-3)

Adobe InDesign Interchange

xmp

 

Adobe Photoshop File Format

psd

Version 8.0-10.0 (CS 1-3)

Adobe Photoshop

psd

Version 4.0

Adobe Portable Document Format

pdf

Versions 1.0-1.7 (Acrobat Versions 1-9, including Japanese PDF)

Adobe Portable Document Format Package, Portfolio

pdf

Version 1.7 (Acrobat Versions 8-9)

Ami Draw

sdw

 

AutoCAD Drawing

dwg

Versions 2.5, 2.6

AutoCAD Drawing

dwg

Versions 9.0-14.0

AutoCAD Drawing

dwg

Versions 2000i-2010

AutoShade Rendering

rnd

Version 2

CALS Raster Format

gp4

Type I and Type II

Computer Graphics Metafile

cgm

ANSI, CALS NIST Versions 3.0

Corel Draw

cdr

Versions 2.0-9.0

Corel Draw Clipart

cmx

Versions 5.0, 7.0

Encapsulated PostScript

eps

tiff header only

Enhanced Metafile

emf

 

Escher graphics

   

GEM File (vector)

gem

 

GEM Image (bitmap)

img

No specific version

Graphics Environment Manager

gem

Bitmap and vector

Graphics Interface Format

gif

No specific version

Hewlett Packard Graphics Language

hpgl

Version 2

IBM Graphics Data Format

gdf

Version 1.0

IBM Picture Interchange Format

pif

Version 1.0

IGES Drawing

igs

Versions 5.1-5.3

JBIG2

jb2

(Graphic embeddings in PDF)

JFIF (jpeg not in tiff format)

jfif

All Versions

JPEG (including EXIF)

jpeg

All versions

JPEG 2000

jpeg

JP2

Kodak flash pix

fpx

 

Kodak Photo CD

pcd

Version 1.0

Lotus PIC

pic

All versions

Lotus Snapshot

snp

All versions

Macintosh PICT and PICT2

pict

Bitmap only

MacPaint

pntg

No specific version

Micrografx Designer

drw

Versions through 3.1

Micrografx Designer

dsf

Version 6.0

Micrografx Draw

drw

Versions through 4.0

Microsoft Windows Bitmap

bmp

 

Microsoft Windows Cursor

cur

 

Microsoft Windows Icon

ico

 

Microsoft XPS (text only)

xps

 

Novell PerfectWorks

draw

Version 2.0

OpenOffice Draw

sda, odg, otg

Versions 1.1-3.0

Oracle Open Office Draw

odg, otg, sxd, std

Versions 3.x

OS/2 Bitmap

bmp, ico, ptr

 

OS/2 Warp Bitmap

bmp

 

Paint Shop Pro 6 (Win32)

psp

Version 5.0, 6.0

PC Paintbrush

pcx, dcx

All versions

Portable Bitmap

pbm

All versions

Portable Graymap

pgm

No specific version

Portable Network Graphics

png

Version 1.0

Portable Pixmap

ppm

No specific version

PostScript

ps

Level 2

Progressive JPEG

jpeg

No specific version

StarOffice Draw

sxd

Versions 6.x-9.0

Sun Raster

srs

No specific version

TIFF Group 5 & 6

tiff

Versions through 6

TIFF CCITT Group 3 & 4

tiff

Versions through 6

TrueVision TGA

targa

Version 2.0

Visio (Page Preview mode)

wmf, emf

Version 4

Visio

vsd

Versions 5.0-2007

Visio (file ID only)

xml, vsx

Version 2007

WBMP wireless graphics format

wbmp

No specific version

Windows Enhanced Metafile

emf

No specific version

Windows Metafile

wmf

No specific version

WordPerfect Graphics

wpg, wpg2

Versions 1.0, 2.0-10.0

X-Windows Bitmap

xbm

x10 compatible

X-Windows Dump

xdm

x10 compatible

X-Windows Pixmap

xpm

x10 compatible

Back to top

Indexable Presentation Formats


The following table lists supported presentation formats.

Format

Extension

Versions Supported

Harvard Graphics Chart (DOS)

hgs, cht, ch3, prs

Versions 2.0-3.0

Harvard Graphics (Windows)

hgs, cht, ch3, prs

Windows versions

IBM Lotus Symphony Presentations

odp

Version 1.x

Kingsoft WPS Presentation

wps

Version 2010

Lotus Freelance

pre

Version 1.0-Millenium 9.6

Lotus Freelance for OS/3

pre

Version 2

Lotus Freelance (Windows)

flw, shw, drw, pre

Versions 95, 97

Microsoft PowerPoint for Windows

pptm, pptx

Versions 3.0-2010

Microsoft PowerPoint for Macintosh

ppt, pptx

Versions 4.0-2008

Microsoft PowerPoint for Windows Slideshow

pps, ppsx

Versions 2007-2010

Novell Presentations

shw

Versions 3.0, 7.0

OpenOffice Impress

odp

Versions 1.1, 3.0

Oracle Open Office Impress

odp, odg, otp, sxi

Version 3.x

StarOffice Impress (Windows and UNIX)

text only

StarOffice versions 5.2-9.0 and OpenOffice version 1.1 (text only)

WordPerfect Presentations

wpd

Versions 5.1-X4

Back to top

Indexable Email Formats


The following table lists supported email formats.

Format

Extension

Versions Supported

Apple Mail Message

emlx

Version 2.0

Encoded mail messages

mht, multipart (alternative, digest, mixed, newsgroup, signed), tnef

 

IBM Lotus Notes Domino XML Language DXL

dxl

Version 8.5

IBM Lotus Notes NSF (file ID only)

nsf

Versions 7.x, 8.x

IBM Lotus Notes NSF (Windows, Linux x86-32 and Oracle Solaris 32-bit only with Notes Client or Domino Server

nsf

Version 8.x

MBOX Mailbox

mbox

RFC 822

Microsoft Outlook Message (MSG)

msg

Versions 97-2007

Microsoft Outlook Express (EML)

eml

 

Microsoft Outlook Forms Template (OFT)

oft

Versions 97-2007

Microsoft Outlook OST

ost

Versions 97-2007

Back to top

Indexable Multimedia Formats


The following table lists supported multimedia formats.

Format

Extension

Versions Supported

AVI (Metadata extraction only)

avi

 

Flash (text extraction only)

swf

Versions 6.x, 7.x, Lite

Flash (file ID only)

swf

Versions 9, 10

Real Media (file ID only)

rm

 

MP3 (ID3 metadata only)

id3

 

MPEG-1 Audio layer 3 V ID3 (file ID only)

mp3

Versions 1, 2

MPEG-1 Video (file ID only)

mpg

Versions 2, 3

MPEG-2 Audio (file ID only)

mpg

 

MPEG-4 (metadata extraction only)

mp4

 

MPEG-7 (metadata extraction only)

mp7

 

Quicktime (metadata extraction only)

mov, qt

 

Windows Media ASF (metadata extraction only)

wma, wmv

 

Windows Media DVR-MS (metadata extraction only)

dvr-ms

 

Windows Media Audio WMA (metadata extraction only)

wma

 

Windows Media Playlist (file ID only)

wpl

 

Windows Media Video WMV (metadata extraction only)

wmv

 

WAV (metadata extraction only)

wav

 

Back to top

Indexable Archive Formats


The following table lists supported archive formats.

Format

Extension

Versions Supported

7z (BZIP2 and split archives not supported)

7Z

 

7z Self Extracting .exe (BZIP2 and split archives not supported)

exe

 

LZA Self Extracting Compress

   

LZH Compress

lzh

 

Microsoft Office Binder

obt

Versions 95-97

Microsoft Cabinet

cab

 

RAR

rar

Versions 1.5, 2.0, 2.9

Self-extracting .exe

exe

 

UNIX Compress

z

 

UNIX GZip

gz
tgz

 

UNIX tar

tar

 

Uuencode

uue

 

Zip

zip

PKZip

Zip

zip

WinZip

To enable the search appliance to crawl these types of compressed files, comment out these file types under Do Not Follow Patterns on the Content Sources > Web Crawl > Start and Block URLs page.

Back to top

Other Indexable Formats


The following table lists other supported formats.

Format

Extension

Versions Supported

AOL Messenger (file ID only)

aim

Version 7.3

Microsoft InfoPath (file ID only)

xsn

Version 2007

Microsoft Live Messenger (via XML filter)

eml

Version 2010

Microsoft OneNote (file ID only)

one

Version 2007

Microsoft Outlook Message

msg

97 through 2007

Microsoft Project (table view only)

mpp

Versions 98-2003, 2007, 2010

Microsoft Windows Compiled Help (file ID only)

chm

 

Microsoft DLL

dll

 

Microsoft Executable

exe

 

Microsoft Windows Explorer Command (file ID only)

scf

 

Microsoft Windows Help (file ID only)

hlp

 

Microsoft Windows Shortcut (file ID only)

lnk, url

 

Trillian Text Log File (via text filter)

txt

Version 4.2

Trillian Text Log File (file ID only)

txt

Version 4.2

TrueType Font (file ID only)

ttf, ttc

 

vCalendar

vcs

Version 2.1

vCard

vcf, vcard

Version 2.1

Yahoo Messenger

log

Versions 6.x-8

Back to top

Was this article helpful?
How can we improve it?