How can I create a txt sitemap of a website?

sidetone · Jul 23, 2022

How can I automatically generate a sitemap of a website as a text file? A sitemap.txt file that only contains links.
It can be done manually, but maybe a shell or other script could do it? There's nothing in ports for this.

An automatically created XML sitemap could be useful too, but I'm more interested in a txt sitemap to be generated from the FreeBSD command-line. That I temporarily forgot what I had, and that is still available, maybe this isn't so urgent for me anymore.

I copied an adjusted a php file on a website, when I type that web address in, it made a txt sitemap in that directory. I guess a php file could be used from my computer as well. I'm more familiar with Perl scripts from ports doing similar tasks.

<?php
set_time_limit(0);
$page_root = '/directory'; //no change this

$website = "https://mywebsite.net"; //your website address, change this
$textfile="sitemap.txt"; //your sitemap text file name, no need change

// list of file filters

$filefilters[]=".php";
$filefilters[]=".shtml";

// list of disallowed directories
$disallow_dir[] = "unused";

// list of disallowed file types
$disallow_file[] = ".bat";
$disallow_file[] = ".asp";
$disallow_file[] = ".old";
$disallow_file[] = ".save";
$disallow_file[] = ".txt";
$disallow_file[] = ".js";
$disallow_file[] = "~";
$disallow_file[] = ".LCK";
$disallow_file[] = ".zip";
$disallow_file[] = ".ZIP";
$disallow_file[] = ".CSV";
$disallow_file[] = ".csv";
$disallow_file[] = ".css";
$disallow_file[] = ".class";
$disallow_file[] = ".jar";
$disallow_file[] = ".mno";
$disallow_file[] = ".bak";
$disallow_file[] = ".lck";
$disallow_file[] = ".BAK";
$disallow_file[] = ".bk";
$disallow_file[] = "mksitemap";

// simple compare function: equals
function ar_contains($key, $array) {
foreach ($array as $val) {
if ($key == $val) {
return true;
}
}
return false;
}

// better compare function: contains
function fl_contains($key, $array) {
foreach ($array as $val) {
// echo $key.'-------'.$val.' ';
$pos = strpos($key, $val);
if ($pos == FALSE) continue;
return true;
}

return false;
}

// this function changes a substring($old_offset) of each array element to $offset
function changeOffset($array, $old_offset, $offset) {
$res = array();
foreach ($array as $val) {
$res[] = str_replace($old_offset, $offset, $val);
}
return $res;
}

// this walks recursivly through all directories starting at page_root and
// adds all files that fits the filter criterias
// taken from Lasse Dalegaard, http://php.net/opendir
function getFiles($directory, $directory_orig = "", $directory_offset="") {
global $disallow_dir, $disallow_file, $allow_dir,$filefilters,$timebegin;
if ($directory_orig == "") $directory_orig = $directory;

if($dir = opendir($directory)) {
// Create an array for all files found
$tmp = Array();

// Add the files
while($file = readdir($dir)) {
// Make sure the file exists
if($file != "." && $file != ".." && $file[0] != '.' ) {
// If it's a directiry, list all files within it
if(is_dir($directory . "/" . $file)) {
$disallowed_abs = fl_contains($directory."/".$file, $disallow_dir); // handle directories with pathes
$disallowed = ar_contains($file, $disallow_dir); // handle directories only without pathes
$allowed_abs = fl_contains($directory."/".$file, $allow_dir);
$allowed = ar_contains($file, $allow_dir);
if ($disallowed || $disallowed_abs) continue;
//if ($allowed_abs || $allowed)
{
$tmp2 = changeOffset(getFiles($directory . "/" . $file, $directory_orig, $directory_offset), $directory_orig, $directory_offset);
if(is_array($tmp2)) {
$tmp = array_merge($tmp, $tmp2);
}
}
} else { // files
if (fl_contains($file, $filefilters)==false) continue;
if (fl_contains($file, $disallow_file)) continue;
array_push($tmp, str_replace($directory_orig, $directory_offset, $directory."/".$file));
}
}
}

// Finish off the function
closedir($dir);
return $tmp;
}
}

function WriteTextFile($filename,$somecontent)
{
if (is_writable($filename)) {

// In our example we're opening $filename in append mode.
// The file pointer is at the bottom of the file hence
// that's where $somecontent will go when we fwrite() it.
if (!$handle = fopen($filename, 'w')) {
echo "Cannot open file ($filename) ";
return false;
}

// Write $somecontent to our opened file.
if (fwrite($handle, $somecontent) === FALSE) {
echo "Cannot write to file ($filename) ";
return false;
}

//echo "Success, wrote ($somecontent) to file ($filename)";

fclose($handle);

} else {
echo "The file $filename is not writable ";
}

return true;
}

function replacespecailstr($srcstr)
{
$srcstr=str_replace('&','&',$srcstr);
return $srcstr;
}

echo 'Getting File List... ';

$a = getFiles($page_root);

echo 'Creating Text Sitemap... ';
$text='';

$texthandle = fopen($textfile, 'w');

$count=0;
foreach ($a as $file) {
$text=str_replace("%2F","/",$website.rawurlencode(str_replace(".php", "", $file)));

if(!fwrite($texthandle,$text."\r\n"))
{
echo 'can not write '.$text.' ';
break;
}

$count++;
}

fclose($texthandle);

echo 'create ok, count='.$count.' ';
?>

eternal_noob · Jul 23, 2022

Scrapy has a sitemap spider module.

Scrapy crawl all sitemap links

I want to crawl all he links present in the sitemap.xml of a fixed site. I've came across Scrapy's SitemapSpider. So far i've extracted all the urls in the sitemap. Now i want to crawl through each...

stackoverflow.com

You could first create a regular sitemap using one of the free tools available and then use Scrapy to extract the URLs.

Maybe Scrapy is capable of generating a sitemap too, i don't know it.

sidetone · Jul 23, 2022

Most site crawlers do xml sitemaps. I couldn't find one that does txt sitemaps, bc they're less popular. It was difficult enough finding the php script to be copied to a website for using.

I wanted a Perl script as a port on my computer, even though, I don't know how to use Perl. There's nothing in ports for that, except for specific types of sites.

eternal_noob · Jul 23, 2022

How many pages you got? If we are talking about a little amount, you even can use the browser console and run a little Javascript on each individual page.

Quickly extract all links from a web page using JavaScript and the browser console

Use a few lines of JavaScript to extract all hyperlinks on a web page, no coding environment required

towardsdatascience.com

drhowarddrfine · Jul 23, 2022

Do not concern yourself with sitemaps for small sites. The purpose is to allow search engines to find paths and pages that may not be discoverable on their own. This is more likely on large corporate sites but completely unnecessary for you and me

jmos · Jul 23, 2022

You're script is checking for files and excludes some of them by extension - so f.e. dynamic URLs won't be included. And are deep links wanted or not? What about the servers and document roots setup - does it handle f.e. different domains differently? And much more…

Generating a sitemap is always something that cannot be copied from one webspace to another - it's always an individual task; A "generic script" may show the desired results, but I wouldn't rely on it.

But if you already get it as an XML it is for sure simple to extract the wanted text file from it.

sidetone · Jul 23, 2022

I want a text sitemap to copy on my computer to run through a validator, like so: Thread webpage-html-validator-checker-using-freebsd.71553. For any little thing, I can edit it. Doing it all by hand is too tedious and may take over an hour. A text sitemap would also be good just to have on the server, so I wouldn't need an XML sitemap to cover every page with extra markup and additional data (optional for entries) like timestamps.

I'm not a fan of Javascript. Java itself may be great, but the biggest problem with it, is that it is the name of part of Javascript.

I've taken an xml sitemap generated online, and converted it into a text sitemap before. That's not what I'm looking for.

The main question is how it would be done on the FreeBSD terminal.

drhowarddrfine · Jul 23, 2022

sidetone said:
so I wouldn't need an XML sitemap to cover every page with extra markup and additional data

You shouldn't be making a sitemap for every page. That's not how you use sitemaps. As I explained earlier, these are used to help search engines discover the layout of your site when it may be difficult otherwise. Such as whole sections that are not linked to from the rest of the site.

sidetone · Jul 23, 2022

To cover every page...

sidetone said:
I want a text sitemap to copy on my computer to run through a validator, like so: Thread webpage-html-validator-checker-using-freebsd.71553.

drhowarddrfine · Jul 23, 2022

sidetone said:
To cover every page...

Do those validators use the sitemap?

sidetone · Jul 23, 2022

The validator on my FreeBSD computer uses sitemap.txt or copy of that sitemap (as a plain text file) on my computer, so yes. This was answered above, and in the other thread which validates links in a text file all on my FreeBSD computer. The contents of validator.sh on my FreeBSD machine and the ports it uses are in that other thread.

sitemap.txt is just a text file with an html address on every line, without other information. It serves a dual purpose, as a regular sitemap that doesn't need additional data, and to copy to a text file on my computer, to run on that validator from my FreeBSD computer in that thread.

eternal_noob said:
How many pages you got?

I missed this question earlier. About 100.

Above, when I wrote most sites do xml sitemaps. I'm clearing up, that I meant most sites that generate sitemaps do so for .xml, and there's hardly any or none that do so for .txt. xml sitemaps are more popular overall.