HTTPS redirect does not work for crawlers/robots

My domain is:

broadband4europe(dot)com

My web server is (include version):

Apache2

The operating system my web server runs on is (include version):

Ubuntu 20.04

I can login to a root shell on my machine (yes or no, or I don't know):

Yes

I'm using a control panel to manage my site (no, or provide the name and version of the control panel):

No

The version of my client is (e.g. output of certbot --version or certbot-auto --version if you're using Certbot):

0.40.0


Forced redirect when setting up the cert. Works fine for users but robots view the http version. For example Google is indexing some http pages, and Redirect Checker | Check your Statuscode 301 vs 302 (and other tools) show 200 when you give them a http link no matter which user agent is chosen.

I am using Full SSL in Cloudflare and have also tried Flexible. I have set the site URL as https://www in Wordpress.

sites-available .conf looks like this. If I uncomment lines 20-23 inclusive I get a http > https > http redirect chain.

# Added to mitigate CVE-2017-8295 vulnerability
UseCanonicalName On

<VirtualHost *:80>
        ServerAdmin X
        
        ServerName broadband4europe(d0t)com
        ServerAlias www.broadband4europe(d0t)com
        
        DocumentRoot /var/www/broadband4europe(d0t)com/public_html

        <Directory /var/www/broadband4europe(d0t)com/public_html/>
            Options FollowSymLinks
            AllowOverride All
            Require all granted
        </Directory>

        ErrorLog ${APACHE_LOG_DIR}/error.log
        CustomLog ${APACHE_LOG_DIR}/access.log combined
#RewriteEngine on
#RewriteCond %{SERVER_NAME} =broadband4europe(d0t)com [OR]
#RewriteCond %{SERVER_NAME} =www.broadband4europe(d0t)com
#RewriteRule ^ https://%{SERVER_NAME}%{REQUEST_URI} [END,NE,R=permanent]
</VirtualHost>

I have other sites running on this server with no issues.

The site is behind Cloudflare CDN:

Name:      broadband4europe(dot)com
Addresses: 2606:4700:3034::ac43:bb59
           2606:4700:3033::6815:38ae
           172.67.187.89
           104.21.56.174
3 Likes

Correct.

Given Apache's redirects are commented out any redirect is probably being done by Cloudflare. Do you have cache rules or something setup that would serve robots.txt from there?

In any event, server redirect strategy is not much of a Let's Encrypt issue. I think asking about this on the Cloudflare forum will help you more directly.

5 Likes

You show the HTTP vhost.
That should not be in use; As CF should redirect all HTTP to HTTPS.

Then you should set CF to redirect to HTTPS.
And be sure you are accessing your site through CF [not directly].

4 Likes

That suggests your HTTPS virtualhost has some redirect too. Please also show the HTTPS virtualhost.

Also, what's your Cloudflare "setting" with regard to HTTPS? "Flex"? "Full"? Something else?

1 Like

CF SSL/TLS is already set to "Full" and I have already tried "Flexible" as mentioned in the OP. This ("Full" plus Let's Encrypt) is the same setup as I have on all of my other sites including those running on this server.

There are no specific caching rules set up in Cloudflare. Robots.txt looks the same as the other sites I have set up that are not experiencing this issue.

I have tried turning off CF SSL, this results in a redirect loop when trying to access it with the http and https versions.

I have tried adding this to .htaccess and restarting Apache to no avail.

RewriteEngine On
RewriteBase /
RewriteCond %{HTTPS} =on
RewriteCond %{HTTP_HOST} ^www.broadband4europe(d0t)com

sites-available le-ssl.conf looks like this:

<IfModule mod_ssl.c>
<VirtualHost *:443>
        ServerAdmin X
        
        ServerName broadband4europe(d0t)com
        ServerAlias www.broadband4europe(d0t)com
        
        DocumentRoot /var/www/broadband4europe(d0t)com/public_html

        <Directory /var/www/broadband4europe(d0t)com/public_html/>
            Options FollowSymLinks
            AllowOverride All
            Require all granted
        </Directory>

        ErrorLog ${APACHE_LOG_DIR}/error.log
        CustomLog ${APACHE_LOG_DIR}/access.log combined

Include /etc/letsencrypt/options-ssl-apache.conf
SSLCertificateFile /etc/letsencrypt/live/broadband4europe(d0t)com/fullchain.pem
SSLCertificateKeyFile /etc/letsencrypt/live/broadband4europe(d0t)com/privkey.pem
</VirtualHost>
</IfModule>

What is the exact user agent string of a bot that fails to access your site? You can test specific user agents with curl:

curl -A "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" https://www.broadband4europe.com
1 Like

Please show:
sudo apachectl -t -D DUMP_VHOSTS

3 Likes

Bots are not having issues accessing the site. They are not being redirected to https when viewing http URLs.

Googlebot (whichever version they are using right now to crawl from a user submission in GSC) and Toolbot (which Redirect Checker | Check your Statuscode 301 vs 302 is using) are two examples. Same issue occurs when using other user agents that that website allows you to select. I am not sure how to extract the exact user agent string of these two bots.

sudo apachectrl -t -D DUMP_VHOSTS returns:

AH00558: apache2: Could not reliably determine the server's fully qualified domain name, using 10.16.0.5. Set the 'ServerName' directive globally to suppress this message
VirtualHost configuration:
*:443                  is a NameVirtualHost
         default server othersite1(d0t)com (/etc/apache2/sites-enabled/000-default-le-ssl.conf:2)
         port 443 namevhost othersite1(d0t)com (/etc/apache2/sites-enabled/000-default-le-ssl.conf:2)
                 alias www.othersite1(d0t)com
         port 443 namevhost broadband4europe(d0t)com (/etc/apache2/sites-enabled/broadband4europe(d0t)com-le-ssl.conf:2)
                 alias www.broadband4europe(d0t)com
*:80                   is a NameVirtualHost
         default server othersite1(d0t)com (/etc/apache2/sites-enabled/000-default.conf:4)
         port 80 namevhost othersite1(d0t)com (/etc/apache2/sites-enabled/000-default.conf:4)
                 alias www.othersite1(d0t)com
         port 80 namevhost broadband4europe(d0t)com (/etc/apache2/sites-enabled/broadband4europe(d0t)com.conf:4)
                 alias www.broadband4europe(d0t)com
         port 80 namevhost othersite2(d0t)com (/etc/apache2/sites-enabled/othersite2(d0t)com.conf:4)
                 alias www.othersite2(d0t)com
         port 80 namevhost othersite3(d0t)com (/etc/apache2/sites-enabled/othersite3(d0t)com.conf:4)
                 alias www.$domain


3 Likes

Also note that curl won't follow redirects if you don't ask it to (-L)

4 Likes

I would remove all the redirection stuff from it; As its' location is served by both HTTP and HTTPS:

2 Likes

I see, so is X-Redirect-By: WordPress not relevant? I don't use wordpress but I'd risk the assumption that the redirect is being specified by WordPress (not apache, or cloudflare).

2 Likes

Currently using the default Wordpress .htaccess: htaccess – WordPress.org Documentation

Could X-Redirect-By be contributing to this? If so I can try to disable this response header. In saying that I've never had to do that before on Wordpress, and haven't done anything with this install that should affect redirects.

Turns out it was a Let's Encrypt issue. Replacing the certificate fixed the problem.

I highly doubt that it was the certificate. Certificates don't influence redirects themselves.

3 Likes

Replacing the certificate probably also involved restarting/reloading the service.

4 Likes