tools/centos-mirror-tools/dl_other_from_centos_repo.sh
Davlet Panech ac49ff342c use curl + avoid partial downloads
Mirror scripts sometimes leave corrupted/partial files behind.

Problems
========

1) wget is called with the -O flag, and the server returns an HTTP
error for the requested URL (404 etc). Wget leaves a zero-length file
behind. This doesn't seem to happen without the -O flag.

2) wget starts the download which stalls & times out half-way; wget
gives up and requests the same file with a byte offset of the form
"Range: bytes=1234-", and the web server doesn't support open-ended
ranges. In this case wget prints out a warning, leaves a partial file
behind and returns success.

3) Sites like GitHub generate repo tarballs on the fly, eg:
https://github.com/kubernetes/kubernetes/archive/refs/tags/v1.19.3.tar.gz
Since tags can move, downloading such a file twice may result in a
different file. Therefore HTTP "resume download" may corrupt files in
this case.

4) Git "keyword expansion" feature may result in differences in source
files being downloaded. For example, this file:

  https://github.com/kubernetes/kubernetes/blob/v1.19.3/staging/src/k8s.io/component-base/version/base.go

contains lines similar to:

  gitVersion  = "v0.0.0-master+$Format:%h$"

where %h is replaced with a short SHA when the tar file is
exported/downloaded.  How short the SHA is depends on git history and
sometimes results in shortened SHAs of different lengths. So
downloading that file may result in different files.

Therefore HTTP "Range" header may corrupt files in this case as
well.

5) Curl is invoked with the "--retry" option and starts the download;
connection stalls; curl gives up, connects again, skips the 1st N
bytes and appends to the partial file. If the file changes while we
are doing this, it will end up corrupting the file. This is very
unlikely to happen and I haven't been able to reproduce this case.

Problems with HTTP Range header
===============================
Curl/wget "resume/continue download" feature has no way of verifying
whether the partial file on disk, and the one being re-requested, are in
fact the same file.  If the file changes on the server between
downloads, "resume download" will corrupt it.

Some web servers don't support this at all, which triggers case (2)
with wget.

Some web servers support the Range header, but require that the end
byte position is present. This is not compatible with wget & curl.
For example curl & wget add headers similar to: "Range: bytes=1234-"
means give me the file starting at offset 1234 and till EOF. This also
triggers case (2).

This patch
==========

* Always download the file to a temporary name, then rename into place

* Use curl instead wget (better error handling). The only exception is
"recursive downloads", which curl doesn't support.

Bug: https://bugs.launchpad.net/starlingx/+bug/1950017
Change-Id: Iaa89009ce23efe5b73ecb8163556ce6db932028b
Signed-off-by: Davlet Panech <davlet.panech@windriver.com>
2021-11-10 14:25:47 -05:00

193 lines
5.2 KiB
Bash
Executable File

#!/bin/bash -e
#
# SPDX-License-Identifier: Apache-2.0
#
#
# Download non-RPM files from https://vault.centos.org/7.4.1708/os/x86_64/
#
DL_OTHER_FROM_CENTOS_REPO_DIR="$(dirname "$(readlink -f "${BASH_SOURCE[0]}" )" )"
source $DL_OTHER_FROM_CENTOS_REPO_DIR/url_utils.sh
source $DL_OTHER_FROM_CENTOS_REPO_DIR/utils.sh
usage () {
echo "$0 [-D <distro>] [-s|-S|-u|-U] [-h] <other_download_list.ini> <save_path> [<force_update>]"
}
# Permitted values of dl_source
dl_from_stx_mirror="stx_mirror"
dl_from_upstream="upstream"
dl_from_stx_then_upstream="$dl_from_stx_mirror $dl_from_upstream"
dl_from_upstream_then_stx="$dl_from_upstream $dl_from_stx_mirror"
# Download from what source?
# dl_from_stx_mirror = StarlingX mirror only
# dl_from_upstream = Original upstream source only
# dl_from_stx_then_upstream = Either source, STX prefered (default)"
# dl_from_upstream_then_stx = Either source, UPSTREAM prefered"
dl_source="$dl_from_stx_then_upstream"
dl_flag=""
distro="centos"
MULTIPLE_DL_FLAG_ERROR_MSG="Error: Please use only one of: -s,-S,-u,-U"
multiple_dl_flag_check () {
if [ "$dl_flag" != "" ]; then
echo "$MULTIPLE_DL_FLAG_ERROR_MSG"
usage
exit 1
fi
}
# Parse out optional arguments
while getopts "D:hsSuU" o; do
case "${o}" in
D)
distro="${OPTARG}"
;;
s)
# Download from StarlingX mirror only. Do not use upstream sources.
multiple_dl_flag_check
dl_source="$dl_from_stx_mirror"
dl_flag="-s"
;;
S)
# Download from StarlingX mirror only. Do not use upstream sources.
multiple_dl_flag_check
dl_source="$dl_from_stx_then_upstream"
dl_flag="-S"
;;
u)
# Download from upstream only. Do not use StarlingX mirror.
multiple_dl_flag_check
dl_source="$dl_from_upstream"
dl_flag="-u"
;;
U)
# Download from upstream only. Do not use StarlingX mirror.
multiple_dl_flag_check
dl_source="$dl_from_upstream_then_stx"
dl_flag="-U"
;;
h)
# Help
usage
exit 0
;;
*)
usage
exit 1
;;
esac
done
shift $((OPTIND-1))
if [ $# -lt 2 ]; then
usage
exit -1
fi
download_list=$1
if [ ! -e $download_list ];then
echo "$download_list does not exist, please have a check!!"
exit -1
fi
save_path=$2
upstream_url_prefix="http://mirror.centos.org/7.6.1810/os/x86_64/"
stx_mirror_url_prefix="$(url_to_stx_mirror_url "$upstream_url_prefix" "$distro")"
echo "NOTE: please assure Internet access to $upstream_url_prefix !!"
force_update=$3
i=0
error_count=0
all=`cat $download_list`
for ff in $all; do
## skip commented_out item which starts with '#'
if [[ "$ff" =~ ^'#' ]]; then
echo "skip $ff"
continue
fi
_type=`echo $ff | cut -d":" -f1-1`
_name=`echo $ff | cut -d":" -f2-2`
if [ "$_type" == "folder" ];then
mkdir -p $save_path/$_name
if [ $? -ne 0 ]; then
echo "Error: mkdir -p '$save_path/$_name'"
error_count=$((error_count + 1))
fi
else
if [ -e "$save_path/$_name" ]; then
echo "Already have $save_path/$_name"
continue
fi
for dl_src in $dl_source; do
case $dl_src in
$dl_from_stx_mirror)
url_prefix="$stx_mirror_url_prefix"
;;
$dl_from_upstream)
url_prefix="$upstream_url_prefix"
;;
*)
echo "Error: Unknown dl_source '$dl_src'"
continue
;;
esac
echo "remote path: $url_prefix/$_name"
echo "local path: $save_path/$_name"
if download_file $url_prefix/$_name; then
file_name=`basename $_name`
sub_path=`dirname $_name`
if [ -e "./$file_name" ]; then
let i+=1
echo "$file_name is downloaded successfully"
mkdir -p $save_path/$sub_path
if [ $? -ne 0 ]; then
echo "Error: mkdir -p '$save_path/$sub_path'"
error_count=$((error_count + 1))
fi
\mv -f ./$file_name $save_path/$_name
if [ $? -ne 0 ]; then
echo "Error: mv -f './$file_name' '$save_path/$_name'"
error_count=$((error_count + 1))
fi
ls -l $save_path/$_name
fi
break
else
echo "Warning: failed to download $url_prefix/$_name"
fi
done
if [ ! -e "$save_path/$_name" ]; then
echo "Error: failed to download '$url_prefix/$_name'"
error_count=$((error_count + 1))
continue
fi
fi
done
echo ""
echo "totally $i files are downloaded!"
if [ $error_count -ne 0 ]; then
echo ""
echo "Encountered $error_count errors"
exit 1
fi
exit 0