The URL's you submit for crawling are recorded.
Return To Crawler Home-Page: "Click Here"
Index of Multi-site: "Click For Multi-Domains Link"
Return To Sitemap Domain Crawler Home-Page:"Click Me"
For a Return To The Sitemap Indexing Crawler:"Click Me"
Sitemap Domain Index of Multi-site:"Click Me"
For Multi-Domains Link:"Click Me"
For Tutorials on Internet-How-To Sitemaps Links and More:"Click Me" - 2016-08-01 04:53:30 - 2016-08-01 04:53:46 - 2016-08-01 04:57:27 - 2016-08-01 05:00:04 - 2016-08-01 05:02:44 - 2016-10-11 22:51:58 - 2016-10-11 22:52:13 - 2016-08-01 04:57:27 - 2016-12-23 04:57:27

Sitemap Domain
Submit Site To Search-Engines
Submit Website to Search-Engines
Sitemap Example
Internet How-to Links
Sitemap Whois Data Trend

What is "Page changing frequency"? Change frequency affects when and how often search engine spiders visit your site’s pages. It may have one of seven values: always, hourly, daily, weekly, monthly, yearly, never. This tells the search engines how often each page is updated. An update refers to actual changes to the HTML code or text of the page.
What is "Last modification date"? This parameter can take one of the next three values: Server's response. Set the date of last modification of the file using server response headers. This value, gives crawlers the information to not recrawling documents that have not changed. We recommend to keep this setting. Current time. Set the date of last modification of the file using the current date and time. None. Do not use any value for Last modification of the files.
What is "Page priority"? The Priority is set to a number between zero and one. If no number is assigned, priority is set to 0.5. This number determines the priority of a particular URL relative to other pages on the same site. A high priority page may be indexed more often and appear above other pages from the same site in search results. Automatic priority reduces the priority of a page depending on depth level.
What is "Depth Level "? Depth level of a page means how many clicks away is this page from homepage.
What is "Exclude extensions "? Files with these extensions found in your website pages are not included in Sitemap Domain (not crawled). Separate input values with spaces.
What is "Do not parse extensions "? Files with these extensions will not be fetched in order to save bandwidth, because they are not html files and have no embedded links but will be included in the Sitemap Domain. Separate input values with spaces.
What is "Session IDs "? If URLs on your site have session IDs in them, you must remove them. Including session IDs in URLs may result in incomplete and redundant crawling of your site. Common session IDs: PHPSESSID, sid, osCsid. Separate with spaces.
What is "Exclude URLs "? URLs that contain these strings will not be included on Sitemap Domain. Input values one per line.
e.g. 1 Use string: component/ in order to exlude all pages in e.g. 2 If you have any of the following websites, you may exclude these strings: (copy and paste to Exlude URLs box)

XML Sitemap Domain via wget & shell script

Posted on: Wednesday - 11/30/2016 - 15:59:45

A Sitemap Domain or Sitemap Domain is a file that lists the pages of a webpage that are accessible to users and search engines. This can be in any format as long as whoever is reading it can understand the format. There are mainly two types of format that are used when creating a site map: XML and HTML. All websites should ideally have at least one form of Sitemap Domain, especially for search engine optimization (SEO). An XML Sitemap Domain is usually preferred for SEO as it contains a lot of relevant metadata information for the URLs. Almost all search engines have the capability to read a properly formatted sitemap, which is then used to index the pages on the website. If you create websites using any of the website building framework such as WordPress or Drupal, then there is already a built-in functionality (or a plugin) that can help you to automatically generate relevant Sitemap Domains. But if you are developing websites using other web technologies such as HTML, CSS, Javascript or PHP without the aid of a website building software or platform, then you will need to create Sitemap Domain manually. If and when you do have to create Sitemap Domain manually, it is often not that bad if it is a small website with just a few pages. If the website has even as few as 30 or 40 pages, then it becomes a nightmare to create the XML Sitemap Domain by hand. It is also a on-going maintenance issue if the website has constant updates where new pages are regularly added along with pages being deleted. It is quite easy to make silly errors including spelling mistakes or to miss pages.
We will try to create or develop a simple shell script that can crawl the website and generate a simple workable XML Sitemap Domain. That would make it very easy to regularly generate Sitemap Domains. We will use the wget utility in Linux to crawl the website. I will take a step by step approach so that you can learn and better understand how the script is created. If it is not of much interest to you then just scroll to the bottom to get the complete script. We will assume that your website is running locally on localhost, which is usually the case if you are in the process of developing the site (not always though). We will crawl the website first using the simplest of wget command. $ wget http://localhost/mywebsite/
Now, we do not really care about saving the content of the webpages locally. Also, we need to recurse into the website hierarchy rather than just the home page. Let’s add the recursive and the spider option to wget. $ wget --spider --recursive http://localhost/mywebsite/
Now, wget by default only crawls to a depth of 5. We want to crawl the entire website no matter what the depth. We will set the depth to infinity. You can modify this to the depth level you want. $wget --spider --recursive --level=inf http://localhost/mywebsite/
We will store the output to a local file, which we will be able to manipulate later. Also, we will use the –no-verbose option to reduce the logs. We just need the URL of the page that it is downloading, and nothing else. Keeping it small will make it easier to parse it. So, the command now looks like this: $wget --spider --recursive --level=infinity --no-verbose --output-file=/home/tom/temp/linklist.txt http://localhost/mywebsite/

Now, this file linklist.txt contains all the URLs on the website, however not in the exact format we want. We will strip out only the URL part of this log file using grep and awk. The log messages in the output file should be something like what is in the example below. The URL that we are interested is the text just after the word URL: up till the next blank or white space character. So, now we will try to extract that text using awk. We will pipe several awk commands in steps to strip out exactly what we want from the lines. 2015-08-07 15:21:59 URL:http://localhost/mywebsite/keynotes/feed/ [769] -> "localhost/mywebsite/keynotes/feed/" [1]
First, we get all the lines that we want from the file.. $ grep -i URL /home/tom/temp/linklist.txt
Now, we will split the line and strip out the part after the string URL: from the line using awk. That should be simple enough: awk -F 'URL:' '{print $2}'
Now, we can trim out the spaces from the line, as it might contain some leading spaces. awk '{$1=$1};1'
The next step is to strip out just the url, which is a first part of the string up till the first white space. awk '{print $1}'
You can probably combine all of the above into a single awk command, but this keeps it easy enough to understand. Now, we can sort the URLs and then remove the duplicates suing the sort utility. We will also remove any blank lines using sed. sort -u | sed '/^$/d'
Putting it all together, the entire command will look something like this grep -i URL /home/tom/temp/linklist.txt | awk -F 'URL:' '{print $2}' | awk '{$1=$1};1' | awk '{print $1}' | sort -u | sed '/^$/d' > sortedlinks.txt
There are several other options to do this exact same thing using sed or even tr. But I think the above set is simple and modular enough so that you can customize it further to match your requirements. You could replace the domain name of the URL if needed with a simple sed command. The next step is to generate the sitemap XML file. We will generate the site map from a boilerplate template….with preset values. We will just create a very simple sitemap suitable enough for simple static websites. We are not going to add any extra fields, such as with more sophisticated systems or deal with images. First we will loop through the links in the file, and insert a url tag for each of the URLs we want. We will look at only URL that ends either with a slash (/), or the extensions html or htm. We will add just one xml tag for location for each of these urls. Here is the entire bash shell script for the process.
wget --spider --recursive --level=inf --no-verbose --output-file=/home/tom/temp/linklist.txt $sitedomain
grep -i URL /home/tom/temp/linklist.txt | awk -F 'URL:' '{print $2}' | awk '{$1=$1};1' | awk '{print $1}' | sort -u | sed '/^$/d' > /home/tom/temp/sortedurls.txt
header='<?xml version="1.0" encoding="UTF-8"?><urlset
echo $header > sitemap.xml
while read p; do
  case "$p" in 
  */ | *.html | *.htm)
    echo '<url><loc>'$p'</loc></url>' >> sitemap.xml
done < /home/tom/temp/sortedurls.txt
echo "</urlset>" >> sitemap.xml

You can add additional fields such last-modified, changefreq or priority as needed. I have kept it simple for most part. You requirements will vary and you can adapt and develop this script further to add more fields. While this is a good method to generate sitemap for local websites, there are definitely other methods to generate xml sitemap like online sitemap generators. A note of caution: It is actually not a good idea to create XML files directly from shell scripts, especially if it is complex and large. You might be better off finding and using a perl or python library to create more sophisticated XML sitemap files. You can also use the intermediate files generated by this as your input...123 - 2017-02-06 01:07:12 - 2017-02-06 01:07:13 - 2017-02-06 01:07:13 - 2017-02-06 01:07:13 - 2017-02-06 01:07:13 - 2017-02-06 01:07:13 - 2017-02-06 01:07:13 - 2017-02-06 01:07:13