Jeg havde en samling dagbogsoptagelser liggende i amr, m4a, mp3, ogg og måske et eller to andre formater. Flere oprydninger og sikkerhedskopiering gav mig en kæmpe samling af filer, der skulle reduceres væsentligt.
En del filer havde samme navn (optagelse 0001, 0002, …) med forskellig dato, der endda ikke entydigt var relateret til oprettelsestidspunktet, men tidspunkt for komprimering, kopiering og så videre. Derfor valgte jeg at bruge md5 checksum som udgangspunkt for en vurdering af hver enkelt fils originalitet i forhold til indhold, og udstyrede filerne med md5sum som navn suppleret af filens UNIX epoc oplysning. Dermed kunne jeg sortere alle filerne efter navn og vælge den første fil, som originalfil.
På langt de fleste lydoptagelser sagde jeg dato og tidspunkt, men den supplerende oplysning med UNIX epoc tid i filnavnet var nyttig.
Mobiltelefon som optager til lyd er den langt mest populære løsning i dag. De gamle diktafoner er ekstremt sjældent i brug, men deres kommercielle optagelsesformater er meget, meget mere effektive komprimeringsformater end wav filen eller mp3 filen. De oprindelige udviklere af komprimeringsformaterne Philips og Olympus sælger stadig diktafoner, hvor flere udgaver ledsages af en app, hvor du måske også kan optage, men da jeg så efter sidste gang, var der ikke optagelsesmuligheder i det kommercielle format, derimod tilbyder flere producenter af diktafoner og software oversættelsesværktøjer og tale-til-tekst. Sidstnævnte er virkelig nyttigt, men underligt nok, er det meget svært at benytte på flere sprog, herunder dansk. Ligesom det er meget lettere at benytte tale-til-tekst ved indspilning, men ikke ved aflytning af gamle optagelser, hvor fejlraten er meget højere, selvfølgelig også afhængig af indspilningskvaliteten.
Jeg havde måske 150-200GB filer liggende på en serie lokaliteter. Filerne overførte jeg til en fælles mappe, hvor jeg kørte nedenstående perl script med kommandoen
perl md5naming.pl ‘./filsamling/’
Perl
#! /usr/bin/perl -w use strict; use Digest::MD5 qw(md5_base64); use constant SUCESS => 0; # Code for succesful execution. use constant ERROR => 1; # Code for error in execution. use warnings; use File::Copy qw(move); use Text::CSV; my $exit_status = 0; my $digest = Digest::MD5->new; if ($#ARGV < 0) { print " Inssuficient number of arguments"; print "\nUsage : $0 [dir1]..\n"; exit ERROR; } # # Check provided directories exists. # If not exists then exit showing error, # otherwise process the directory. # # # Directiories with white space parsed in the argument to the script # must not include whitespace unless in quotes foreach my $dirname (@ARGV) { chomp ($dirname); $dirname = trimall($dirname); if (! -d $dirname) { print "$dirname is not directory \n"; $exit_status = ERROR; next; } elsif (! -r $dirname) { print "$dirname is not readable. \n"; $exit_status = ERROR; next; } processfiles($dirname); } print "\n"; exit $exit_status; # # processfiles: # Read recursively through the directory. # If it is file calculate MD5 check sum of # conceptual string from the file's # [userid] [groupid] [permission] [inode] # and file content. Print the file full path # and MD5 check sum against it. # # parameters: # string filename: input file for processing. # return type: integer: return sucess/failure code. # sub processfiles { use Cwd 'abs_path'; #my $dirname = $_[0]; my $dirname = abs_path($_[0]); chomp ($dirname); opendir(DIRH, $dirname); # # Read all the files and directories execluding the current # '.' and parent directory '..' # my @files = sort (grep { !/^\.|\.\.}$/ } readdir (DIRH)); closedir(DIRH); my $file; foreach $file (@files) { my $fullpath = $dirname . "/" . $file; #print "\n$fullpath"; if (-d "$fullpath") { processfiles("$fullpath"); } else { my $md5code=getmd5checksum ("$fullpath"); #Unix epoch time my $mdate= (stat $fullpath)[9]; my ($ext) = $file =~ /(\.[^.]+)$/; my $dirname2 = "C:/Users/GunnarLaier/Downloads/HashKey"; my $newfullpath = $dirname2 . "/" . $md5code . "_" .$mdate . $ext; move $fullpath, $newfullpath; print "$newfullpath" . "\n"; } } return 0; } # # getmd5checksum: # Generate a 64 bit Hex MD5 checksum for file contents # and string [userid] [groupid] [permission] [inode] # # parametrs: string file: file of which MD5 check sum need # to calculated. # return type: MD5 checksum of the contents of the file # with conceptual string # sub getmd5checksum { my $file = shift; if (! -r $file) { return "Not readable"; } else { open (FILE, $file) or return""; $digest->reset(); my $fileInfo = getfileinfo($file); $digest->add($fileInfo); $digest->addfile(*FILE); close (FILE); return $digest->hexdigest; } } # # getfileinfo: # Returns formated string for entered file as # [uid] [gid] [mode] [ino] # # parameters: # string filename: filename of which we need to # find the conceptual string. # return type: string: return conceptual string in # format [uid] [gid] [permission] [ino] # sub getfileinfo { my $file = shift; my (undef, $ino, $mode, undef, $uid, $gid) = stat($file); my $oct = sprintf("%o", $mode & 07777); return $uid . " " . $gid . " " . $oct . " " . $ino; } # # trimall: # used to trim leading and trailing white space characters # from string. # parameters: # string str: input string from which spaces needs # to be removed. # return type : string # trimed string. # sub trimall { my $arg = shift; $arg =~ s/^\s+|\s+$//g; return $arg; }
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
#! /usr/bin/perl -w
use strict;
use Digest::MD5 qw(md5_base64);
use constant SUCESS => 0; # Code for succesful execution.
use constant ERROR => 1; # Code for error in execution.
use warnings;
use File::Copy qw(move);
use Text::CSV;
my $exit_status = 0;
my $digest = Digest::MD5->new;
if ($#ARGV < 0) {
print " Inssuficient number of arguments";
print "\nUsage : $0 [dir1]..\n";
exit ERROR;
}
#
# Check provided directories exists.
# If not exists then exit showing error,
# otherwise process the directory.
#
#
# Directiories with white space parsed in the argument to the script
# must not include whitespace unless in quotes
foreach my $dirname (@ARGV) {
chomp ($dirname);
$dirname = trimall($dirname);
if (! -d $dirname) {
print "$dirname is not directory \n";
$exit_status = ERROR;
next;
} elsif (! -r $dirname) {
print "$dirname is not readable. \n";
$exit_status = ERROR;
next;
}
processfiles($dirname);
}
print "\n";
exit $exit_status;
#
# processfiles:
# Read recursively through the directory.
# If it is file calculate MD5 check sum of
# conceptual string from the file's
# [userid] [groupid] [permission] [inode]
# and file content. Print the file full path
# and MD5 check sum against it.
#
# parameters:
# string filename: input file for processing.
# return type: integer: return sucess/failure code.
#
sub processfiles
{
use Cwd 'abs_path';
#my $dirname = $_[0];
my $dirname = abs_path($_[0]);
chomp ($dirname);
opendir(DIRH, $dirname);
#
# Read all the files and directories execluding the current
# '.' and parent directory '..'
#
my @files = sort (grep { !/^\.|\.\.}$/ } readdir (DIRH));
closedir(DIRH);
my $file;
foreach $file (@files) {
my $fullpath = $dirname . "/" . $file;
#print "\n$fullpath";
if (-d "$fullpath") {
processfiles("$fullpath");
} else {
my $md5code=getmd5checksum ("$fullpath");
#Unix epoch time
my $mdate= (stat $fullpath)[9];
my ($ext) = $file =~ /(\.[^.]+)$/;
my $dirname2 = "C:/Users/GunnarLaier/Downloads/HashKey";
my $newfullpath = $dirname2 . "/" . $md5code . "_" .$mdate . $ext;
move $fullpath, $newfullpath;
print "$newfullpath" . "\n";
}
}
return 0;
}
#
# getmd5checksum:
# Generate a 64 bit Hex MD5 checksum for file contents
# and string [userid] [groupid] [permission] [inode]
#
# parametrs: string file: file of which MD5 check sum need
# to calculated.
# return type: MD5 checksum of the contents of the file
# with conceptual string
#
sub getmd5checksum
{
my $file = shift;
if (! -r $file) {
return "Not readable";
} else {
open (FILE, $file) or return"";
$digest->reset();
my $fileInfo = getfileinfo($file);
$digest->add($fileInfo);
$digest->addfile(*FILE);
close (FILE);
return $digest->hexdigest;
}
}
#
# getfileinfo:
# Returns formated string for entered file as
# [uid] [gid] [mode] [ino]
#
# parameters:
# string filename: filename of which we need to
# find the conceptual string.
# return type: string: return conceptual string in
# format [uid] [gid] [permission] [ino]
#
sub getfileinfo
{
my $file = shift;
my (undef, $ino, $mode, undef, $uid, $gid) = stat($file);
my $oct = sprintf("%o", $mode & 07777);
return $uid . " " . $gid . " " . $oct . " " . $ino;
}
#
# trimall:
# used to trim leading and trailing white space characters
# from string.
# parameters:
# string str: input string from which spaces needs
# to be removed.
# return type : string
# trimed string.
#
sub trimall
{
my $arg = shift;
$arg =~ s/^\s+|\s+$//g;
return $arg;
}
I skriptet har jeg angivet en mappe, hvor alle filerne bliver overført med nyt navn i formatet 34246d8224ed1b737f687b5db539f549_1626526406. Hvilket er md5sum underscore UNIX epoc, ligesom skriptet overfører filtype på den originale fil, så denne kan åbnes uden problemer. Dernæst kører jeg det andet skript sortSelect.pl, hvor kun den nyeste fil vælges. Her sammenligner jeg udfra md5sum. Det bragte min samling ned fra 110GB til 71GB, i alt 37937 filer.
Skriptet, der foretager den endelige sammenligning er næsten identisk med ovenstående. De to skripts kan slås sammen til eet skript med en option som input, ligesom stier til mapper kan angives som argumenter til skriptet. Dertil vil jeg udvide skriptet, så jeg får en csv fil med filnavn, filtype, sti, dato/UNIX epoc, filstørrelse samt md5sum og en kolonne, hvor det angives om filen er kandidat til sletning eller ej. Dette kan bruges til at sortere i billeder, dokumenter, lydfiler og meget mere. Opdateringen bliver emne for et senere blogindlæg. Arbejdsformen giver mulighed for, i kombination med søgealgoritmer, at reducere problemer relateret til data management i hverdagen, men vandt aldrig indpas på dekstoppen, hvor fremtiden snarere er journaliseringssystemer og versionskontrol til også privat brug og med større og større integration af online kommunikation og telefonsystemer.
#! /usr/bin/perl -w use strict; use constant SUCESS => 0; # Code for succesful execution. use constant ERROR => 1; # Code for error in execution. use warnings; use Digest::MD5 qw(md5_base64); use File::Copy qw(copy); my $exit_status = 0; if ($#ARGV < 0) { print " Inssuficient number of arguments"; print "\nUsage : $0 [dir1]..\n"; exit ERROR; } # # Check provided directories exists. # If not exists then exit showing error, # otherwise process the directory. # foreach my $dirname (@ARGV) { chomp ($dirname); $dirname = trimall($dirname); if (! -d $dirname) { print "$dirname is not directory \n"; $exit_status = ERROR; next; } elsif (! -r $dirname) { print "$dirname is not readable. \n"; $exit_status = ERROR; next; } processfiles($dirname); } print "\n"; exit $exit_status; # # processfiles: # Read recursively through the directory. # If it is file calculate MD5 check sum of # conceptual string from the file's # [userid] [groupid] [permission] [inode] # and file content. Print the file full path # and MD5 check sum against it. # # parameters: # string filename: input file for processing. # return type: integer: return sucess/failure code. # sub processfiles { use Cwd 'abs_path'; #my $dirname = $_[0]; my $dirname = abs_path($_[0]); chomp ($dirname); opendir(DIRH, $dirname); # # Read all the files and directories execluding the current # '.' and parent directory '..' # my @files = sort (grep { !/^\.|\.\.}$/ } readdir (DIRH)); closedir(DIRH); my $file; my $md5textold="string"; foreach $file (@files) { my $fullpath = $dirname . "/" . $file; #print "\n$fullpath"; my ($md5text) = $file =~ /(^[^_]+)/; if (-d "$fullpath") { processfiles("$fullpath"); } else { if ($md5textold ne $md5text) { my $dirname2 = "C:/Users/GunnarLaier/Downloads/HashKey"; my $newfullpath = $dirname2 . "/" . $file; copy $fullpath, $newfullpath; print "$newfullpath" . "\n"; } else{ #unlink($fullpath); } $md5textold=$md5text; } } return 0; } # # getfileinfo: # Returns formated string for entered file as # [uid] [gid] [mode] [ino] # # parameters: # string filename: filename of which we need to # find the conceptual string. # return type: string: return conceptual string in # format [uid] [gid] [permission] [ino] # sub getfileinfo { my $file = shift; my (undef, $ino, $mode, undef, $uid, $gid) = stat($file); my $oct = sprintf("%o", $mode & 07777); return $uid . " " . $gid . " " . $oct . " " . $ino; } # # trimall: # used to trim leading and trailing white space characters # from string. # parameters: # string str: input string from which spaces needs # to be removed. # return type : string # trimed string. # sub trimall { my $arg = shift; $arg =~ s/^\s+|\s+$//g; return $arg; }
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
#! /usr/bin/perl -w
use strict;
use constant SUCESS => 0; # Code for succesful execution.
use constant ERROR => 1; # Code for error in execution.
use warnings;
use Digest::MD5 qw(md5_base64);
use File::Copy qw(copy);
my $exit_status = 0;
if ($#ARGV < 0) {
print " Inssuficient number of arguments";
print "\nUsage : $0 [dir1]..\n";
exit ERROR;
}
#
# Check provided directories exists.
# If not exists then exit showing error,
# otherwise process the directory.
#
foreach my $dirname (@ARGV) {
chomp ($dirname);
$dirname = trimall($dirname);
if (! -d $dirname) {
print "$dirname is not directory \n";
$exit_status = ERROR;
next;
} elsif (! -r $dirname) {
print "$dirname is not readable. \n";
$exit_status = ERROR;
next;
}
processfiles($dirname);
}
print "\n";
exit $exit_status;
#
# processfiles:
# Read recursively through the directory.
# If it is file calculate MD5 check sum of
# conceptual string from the file's
# [userid] [groupid] [permission] [inode]
# and file content. Print the file full path
# and MD5 check sum against it.
#
# parameters:
# string filename: input file for processing.
# return type: integer: return sucess/failure code.
#
sub processfiles
{
use Cwd 'abs_path';
#my $dirname = $_[0];
my $dirname = abs_path($_[0]);
chomp ($dirname);
opendir(DIRH, $dirname);
#
# Read all the files and directories execluding the current
# '.' and parent directory '..'
#
my @files = sort (grep { !/^\.|\.\.}$/ } readdir (DIRH));
closedir(DIRH);
my $file;
my $md5textold="string";
foreach $file (@files) {
my $fullpath = $dirname . "/" . $file;
#print "\n$fullpath";
my ($md5text) = $file =~ /(^[^_]+)/;
if (-d "$fullpath") {
processfiles("$fullpath");
} else {
if ($md5textold ne $md5text) {
my $dirname2 = "C:/Users/GunnarLaier/Downloads/HashKey";
my $newfullpath = $dirname2 . "/" . $file;
copy $fullpath, $newfullpath;
print "$newfullpath" . "\n";
}
else{
#unlink($fullpath);
}
$md5textold=$md5text;
}
}
return 0;
}
#
# getfileinfo:
# Returns formated string for entered file as
# [uid] [gid] [mode] [ino]
#
# parameters:
# string filename: filename of which we need to
# find the conceptual string.
# return type: string: return conceptual string in
# format [uid] [gid] [permission] [ino]
#
sub getfileinfo
{
my $file = shift;
my (undef, $ino, $mode, undef, $uid, $gid) = stat($file);
my $oct = sprintf("%o", $mode & 07777);
return $uid . " " . $gid . " " . $oct . " " . $ino;
}
#
# trimall:
# used to trim leading and trailing white space characters
# from string.
# parameters:
# string str: input string from which spaces needs
# to be removed.
# return type : string
# trimed string.
#
sub trimall
{
my $arg = shift;
$arg =~ s/^\s+|\s+$//g;
return $arg;
}
Identifikation af større filer er også nyttigt, og kan også ske med brug af Powershell. Her en kommando til at danne en liste over filer på mere end 500MB.
PowerShell
Get-ChildItem C:\Users\GunnarLaier\Documents\FilerArkivRåData\ -recurse | where-object {$_.length -gt 524288000} | Sort-Object length | ft fullname, length -auto
1
Get-ChildItem C:\Users\GunnarLaier\Documents\FilerArkivRåData\ -recurse | where-object {$_.length -gt 524288000} | Sort-Object length | ft fullname, length -auto
Comments