/webmasters/community?hl=en
/webmasters/community?hl=en
1/16/10
Original Poster
sook65

SEO strategy for PDF files

I have read the FAQs and checked for similar issues: YES
My site's URL is: www.hpisum.com
Description (including timeline of any changes made):
 
The content from PDF files on our site are contributing at all to our search engine traffic. Though they are indexed in the database, no one is able to find them. So, we are exploring options to make this content more available. We are currently developing a massive re-structure of the website.
 
Well over half of the content that we have on our website is in the form of PDF files. This is because our company has published a printed magazine for about 20 years, and so we have put some of those articles on the website. Many of those pdf's can be found here: http://www.hpisum.com/page.aspx?page_id=12
 
We know that it is possible to create PDF's in a SEO way (ex: using headings, titles, file names, ect.). However, looking back, none of those PDFs are SEO, and repairing them is just too daunting of a task.
 
So, at this point, we are thinking about providing SEO HTML versions, summaries, or excerpts of the PDF files. (This would also provide a service to our customers, so that they wouldn't have to go through the hassle of downloading it). That way, it would also be easier to link to and from the magazine articles, and also give more fluidity to our site, because the content could be viewed without people leaving it.
 
However, the biggest problem here is probably duplicate content. Will we get docked for providing both HTML and PDF versions of content as a service to surfers?
Should we use Robots.txt to block google from our PDF's and just use the HTML versions? It just sounds ironic that we would block Google from the very content that we are trying to make more available...
 
What is the best way to deal with getting content in PDF files to be found on Google?
 
I have done online research on SEO and PDFs, but I just haven't found anything definitive.
 
.
Community content may not be verified or up-to-date. Learn more.
All Replies (9)
Google user
1/16/10
Google user
Well, if oyu are not going to improve the SE aspect for the PDFs, the only other option is to generate content in a different format and promote that.
(No idea ahow much "SEO" you can do on a PDF though?)


And if you aren't going to do the SEO work on the PDFs, you might as well block them (or use the header meta responses and give G noindex headers) to avoid Dupe issues.

Google user
1/16/10
Google user
Agree with Autocrat that a different format would be a good option. Search engines apart, many visitors don't appreciate pdfs and the content on yours is a natural part of your site.

As an alternative to removing them you could put 301s in place. Only a few hours work and well worth the time, will clear the situation quicker and likely to carry forward any value which is there.
1/16/10
Original Poster
sook65
I like those suggestions. It just makes more sense to provide HTML versions of the documents. We will probably start converting the files, and releasing them in phases.
JohnMu
1/18/10
JohnMu
Hi sook65

As the others mentioned, using HTML versions (or even the HTML summaries you mentioned) would be a great idea in my opinion, if most of your content is in the PDF files at the moment. While PDFs are certainly useful if you need to keep everything looking exactly like you have it in the magazine, web-users may prefer to be able to read the content before opting to download and view a PDF file.

If you opt to focus on your PDF files, I would recommend making sure that the content is available in a textual form within the PDF file and that it is in the proper character-sets (a simple test is to select some of the text and to try to copy it into a text-editor) -- from looking at your PDFs, it appears that your files are fine in that regard at the moment. By keeping the textual content accessible like that, you're making it easier for search engines to extract the relevant information from your PDF files.

You generally do not need to worry about duplicate content in a situation like this, even if you decide to mirror the content of your PDFs on HTML pages. If we recognize the URLs as containing duplicate content, we'll just show one of them to users when they search; your site generally wouldn't have any disadvantage by doing this.

Also, one thing which I noticed when looking at the URL that you posted is that the PDFs have somewhat strange file names. While this generally won't stop us from finding them, it may make it difficult for users to link to your content directly, which in turn, may make it hard for other people to discover your content. To help in that respect, I would suggest using a simple system for file names, perhaps based on the issue number or the date of publication, that also excludes "special" characters like spaces and punctuation marks.

Hope it helps!
Cheers
John
1/18/10
Original Poster
sook65
Thanks for all the time and thought that you put into answering this question. I really appreciate the insight. JohnMu -- thanks for giving a good answer about whether or not we would be penalized for having duplicate content in both pdf and html format.
 
We'll definately create a unique web page for each PDF article, so we can provide the content in HTML. We will also optimize the pages with sensible file names, title/description tags, ect. We would also provide an optional link for internet readers to download the article in PDF format, so they could see what it would look like when it was formatted correctly, as it was in the printed magazine. Another advantage is that this would make it easier for surfers to link to and bookmark these pages.
 
Thanks again
hebaatef
1/21/10
hebaatef
John Mu: thanks for your post about the duplicate content. What is the case with duplicate content for domains? Would Google ban a domain because it is forwarded to another domain? right now I have a Web site with content which is not optimized for search engine yet. I am re-branding the business and would like to go with a new domain name and was thinking of forwarding the current domain to the new one. Would this operation be interpreted as I am trying to spam the search engine? H.A.
Google user
1/22/10
Google user
...hebaatef...
There are more than a few topics covering your question - Searching could possibly save you posting?
Failing that, if you have a question/issue, you should create your own topic and post all relevant info in that, rather than posting in someone elses topic.
Those points aside - a very big "Thank you" for at least looking for a related topic - very much appreciated :D


Yura
1/27/10
Yura
John, when you say "we'll just show one of them to users when they search" that's the problem, because I'd rather want my visitors land on the HTML page, not the PDF.
 
Would you recommend blocking PDFs in this case or you'll try to have visitors land on HTML pages, rather than PDFs, anyway, if I leave them unblocked?
 
Thanks.
JohnMu
1/27/10
JohnMu
Hi Yura!

If you have the same content in PDF as in HTML pages, in most cases we'll probably show the HTML versions above (or in place of) the PDF versions. If this is a problem for your specific situation, I'd consider using the robots.txt or x-robots-tag to prevent the PDF files from getting indexed. I imagine for most sites this is not really a problem, so I wouldn't suggest blocking indexing of PDF files without confirming that it's really necessary.

The only situation where I would consider doing something in advance is when the CMS automatically creates PDF-copies of normal HTML pages. Generally speaking, this shouldn't cause any problems, but those PDF versions are likely not compelling enough to merit getting indexed separately (and crawling them will possibly put a load on your server that you could avoid). Ultimately, it's up to you to determine which content you wish to have crawled and indexed :-) -- if you feel that PDF-copies of your content are compelling enough for users who search for your content, feel free to make them available.

Cheers
John
Were these replies helpful?
How can we improve them?
 
This question is locked and replying has been disabled. Still have questions? Ask the Help Community.

Badges

Some community members might have badges that indicate their identity or level of participation in a community.

 
Expert - Google Employee — Googler guides and community managers
 
Expert - Community Specialist — Google partners who share their expertise
 
Expert - Gold — Trusted members who are knowledgeable and active contributors
 
Expert - Platinum — Seasoned members who contribute beyond providing help through mentoring, creating content, and more
 
Expert - Alumni — Past members who are no longer active, but were previously recognized for their helpfulness
 
Expert - Silver — New members who are developing their product knowledge
Community content may not be verified or up-to-date. Learn more.

Levels

Member levels indicate a user's level of participation in a forum. The greater the participation, the higher the level. Everyone starts at level 1 and can rise to level 10. These activities can increase your level in a forum:

  • Post an answer.
  • Having your answer selected as the best answer.
  • Having your post rated as helpful.
  • Vote up a post.
  • Correctly mark a topic or post as abuse.

Having a post marked and removed as abuse will slow a user's advance in levels.

View profile in forum?

To view this member's profile, you need to leave the current Help page.

Report abuse in forum?

This comment originated in the Google Product Forum. To report abuse, you need to leave the current Help page.

Reply in forum?

This comment originated in the Google Product Forum. To reply, you need to leave the current Help page.