Jul 1, 2022

What is the process for getting clean URLs (without subdomain and extension) indexed?

I prefer clean URLs... so my internal links lack the .html or .php extension, and I'm omitting the www. subdomain. The page on the server might be my-page.html or my-page.php, and the clean URLs are resolved by .htaccess rules so the browser finds the page and shows example.com/my-page. This makes me smile.

For consistency, I entered the clean URLs as canonical links and in the sitemap (as https://example.com/my-page) that I submitted to Google. My hope was to see these clean URLs used everywhere consistently.

Unfortunately, I now see a mess in Search Console. Some indexed pages still have .html. Some still have www. Some pages were not indexed.

I'm attempting to clean up the mess by resubmitting individual URLs, but I have dozens to fix. It seems each link has to be submitted separately. It takes about 30 seconds to submit each one, and I hit my submission cap each day. This will take forever.

There must be a better way to clean up the mess. Any suggestions to make it easier?
Also, did I make the wrong choice with the canonicals and sitemap? If so, what should I have done?

Here's what I have in .htaccess:

Options +FollowSymLinks 
RewriteEngine on 

RewriteCond %{HTTPS} off [OR]
RewriteCond %{HTTP_HOST} ^www\. [NC]
RewriteCond %{HTTP_HOST} ^(www\.)?(.+)$
RewriteRule (.*) https://%2%{REQUEST_URI} [R=301,NE,L] 

# if x.php is a file, add .php to x 
RewriteCond %{REQUEST_FILENAME}\.php -f 
RewriteRule !.*\.php$ %{REQUEST_URI}.php [NC,QSA,L]

# if x.html is a file, add .html to x 
RewriteCond %{REQUEST_FILENAME}\.html -f 
RewriteRule !.*\\.html$ %{REQUEST_URI}.html [NC,QSA,L] 

# if x.index.html is a file, add index.html to x 
RewriteCond %{REQUEST_FILENAME}\index.html -f 
RewriteRule !.*\index\.html$ %{REQUEST_URI}index.html [NC,QSA,L]

Locked
Informational notification.
This question is locked and replying has been disabled.
Community content may not be verified or up-to-date. Learn more.
Last edited Jul 2, 2022
Recommended Answer
Jul 1, 2022
301 redirects from the non-canonical to canonical URL will clean it up automatically.
But since each URL needs to be crawled it could take time to sort out, especially if the old URL is orphaned (not linked to anywhere).
 
If you're impatient you could try creating a HTML or XML sitemap containing the incorrect URLs and a correct <lastmod> date to help Googlebot find the redirecting pages.  Even then expect it to take time. 
 
Original Poster ChillyWilly2 marked this as an answer
Helpful?
All Replies (4)
Recommended Answer
Jul 1, 2022
301 redirects from the non-canonical to canonical URL will clean it up automatically.
But since each URL needs to be crawled it could take time to sort out, especially if the old URL is orphaned (not linked to anywhere).
 
If you're impatient you could try creating a HTML or XML sitemap containing the incorrect URLs and a correct <lastmod> date to help Googlebot find the redirecting pages.  Even then expect it to take time. 
 
Original Poster ChillyWilly2 marked this as an answer
Jul 1, 2022
@OptimistPrime - Thank you so much for your reply, and I love your Avatar name. Please help me understand how 301 redirects to the canonical would not create a loop given the .htaccess rules I shared above. Also, since the files themselves have the .html and .php extensions, will pointing 301 redirects to the names without extension even work? I'm very confused.
Last edited Jul 1, 2022
Jul 2, 2022
I had a brief chat with someone who hinted at the existence of internal and external redirects. It seems the rules I posted work internally, and a 301 redirect is external. So if they are working in separate universes then I can see how they might not cause a loop.

I still need to figure out how to sequence the code to accomplish my goal. I'm thinking of adding something like this somewhere in the chain:

# 301 redirect all .html extensioned requests to extensionless
RewriteCond "$1.html" -f
RewriteRule "^(.*).html$" "$1" [NE,R=301]

# 301 redirect all .php extensioned requests to extensionless
RewriteCond "$1.php" -f
RewriteRule "^(.*).php$" "$1" [NE,R=301]

Jul 2, 2022
Here is the solution I came up with after researching for many hours...

Options +FollowSymLinks -MultiViews
RewriteEngine On
RewriteBase /

RewriteCond %{HTTPS} off [OR]
RewriteCond %{HTTP_HOST} ^www\. [NC]
RewriteCond %{HTTP_HOST} ^(www\.)?(.+)$
RewriteRule (.*) https://%2%{REQUEST_URI} [R=301,NE,QSA]

# 301 external redirect all .php extensioned requests to extensionless
RewriteCond %{THE_REQUEST} ^[A-Z]{3,}\s([^.]+)\.php [NC]
RewriteRule ^ %1 [R=301,NC,NE,QSA,L]

# 301 external redirect all .html extensioned requests to extensionless
RewriteCond %{THE_REQUEST} ^[A-Z]{3,}\s([^.]+)\.html [NC]
RewriteRule ^ %1 [R=301,NC,NE,QSA,L]

# if x.php is a file, internally add .php to x
RewriteCond %{REQUEST_FILENAME}\.php -f
RewriteRule !.*\.php$ %{REQUEST_URI}.php [NC,QSA,L]

# if x.html is a file, internally add .html to x
RewriteCond %{REQUEST_FILENAME}\.html -f
RewriteRule !.*\.html$ %{REQUEST_URI}.html [NC,QSA,L]

# if x.index.html is a file, internally add index.html to x
RewriteCond %{REQUEST_FILENAME}\index.html -f
RewriteRule !.*\index.html$ %{REQUEST_URI}index.html [NC,QSA,L]

false
17154917319049051526
true
Search Help Center
true
true
true
true
true
83844
Search
Clear search
Close search
Main menu
false
false