Filoprydning og tagging

Gunnar Hellmund Laier
3. maj 2022
9 min læsning

Opdateret: 29. sep. 2022

Jeg havde en samling dagbogsoptagelser liggende i amr, m4a, mp3, ogg og måske et eller to andre formater. Flere oprydninger og sikkerhedskopiering gav mig en kæmpe samling af filer, der skulle reduceres væsentligt.

En del filer havde samme navn (optagelse 0001, 0002, …) med forskellig dato, der endda ikke entydigt var relateret til oprettelsestidspunktet, men tidspunkt for komprimering, kopiering og så videre. Derfor valgte jeg at bruge md5 checksum som udgangspunkt for en vurdering af hver enkelt fils originalitet i forhold til indhold, og udstyrede filerne med md5sum som navn suppleret af filens UNIX epoc oplysning. Dermed kunne jeg sortere alle filerne efter navn og vælge den første fil, som originalfil.

På langt de fleste lydoptagelser sagde jeg dato og tidspunkt, men den supplerende oplysning med UNIX epoc tid i filnavnet var nyttig.

Mobiltelefon som optager til lyd er den langt mest populære løsning i dag. De gamle diktafoner er ekstremt sjældent i brug, men deres kommercielle optagelsesformater er meget, meget mere effektive komprimeringsformater end wav filen eller mp3 filen. De oprindelige udviklere af komprimeringsformaterne Philips og Olympus sælger stadig diktafoner, hvor flere udgaver ledsages af en app, hvor du måske også kan optage, men da jeg så efter sidste gang, var der ikke optagelsesmuligheder i det kommercielle format, derimod tilbyder flere producenter af diktafoner og software oversættelsesværktøjer og tale-til-tekst. Sidstnævnte er virkelig nyttigt, men underligt nok, er det meget svært at benytte på flere sprog, herunder dansk. Ligesom det er meget lettere at benytte tale-til-tekst ved indspilning, men ikke ved aflytning af gamle optagelser, hvor fejlraten er meget højere, selvfølgelig også afhængig af indspilningskvaliteten.

Jeg havde måske 150-200GB filer liggende på en serie lokaliteter. Filerne overførte jeg til en fælles mappe, hvor jeg kørte nedenstående perl script med kommandoen

perl md5naming.pl ‘./filsamling/’

md5naming.pl

Perl

#! /usr/bin/perl -w use strict; use Digest::MD5 qw(md5_base64); use constant SUCESS => 0; # Code for succesful execution. use constant ERROR => 1; # Code for error in execution. use warnings; use File::Copy qw(move); use Text::CSV; my $exit_status = 0; my $digest = Digest::MD5->new; if ($#ARGV < 0) { print " Inssuficient number of arguments"; print "\nUsage : $0 [dir1]..\n"; exit ERROR; } # # Check provided directories exists. # If not exists then exit showing error, # otherwise process the directory. # # # Directiories with white space parsed in the argument to the script # must not include whitespace unless in quotes foreach my $dirname (@ARGV) { chomp ($dirname); $dirname = trimall($dirname); if (! -d $dirname) { print "$dirname is not directory \n"; $exit_status = ERROR; next; } elsif (! -r $dirname) { print "$dirname is not readable. \n"; $exit_status = ERROR; next; } processfiles($dirname); } print "\n"; exit $exit_status; # # processfiles: # Read recursively through the directory. # If it is file calculate MD5 check sum of # conceptual string from the file's # [userid] [groupid] [permission] [inode] # and file content. Print the file full path # and MD5 check sum against it. # # parameters: # string filename: input file for processing. # return type: integer: return sucess/failure code. # sub processfiles { use Cwd 'abs_path'; #my $dirname = $_[0]; my $dirname = abs_path($_[0]); chomp ($dirname); opendir(DIRH, $dirname); # # Read all the files and directories execluding the current # '.' and parent directory '..' # my @files = sort (grep { !/^\.|\.\.}$/ } readdir (DIRH)); closedir(DIRH); my $file; foreach $file (@files) { my $fullpath = $dirname . "/" . $file; #print "\n$fullpath"; if (-d "$fullpath") { processfiles("$fullpath"); } else { my $md5code=getmd5checksum ("$fullpath"); #Unix epoch time my $mdate= (stat $fullpath)[9]; my ($ext) = $file =~ /(\.[^.]+)$/; my $dirname2 = "C:/Users/GunnarLaier/Downloads/HashKey"; my $newfullpath = $dirname2 . "/" . $md5code . "_" .$mdate . $ext; move $fullpath, $newfullpath; print "$newfullpath" . "\n"; } } return 0; } # # getmd5checksum: # Generate a 64 bit Hex MD5 checksum for file contents # and string [userid] [groupid] [permission] [inode] # # parametrs: string file: file of which MD5 check sum need # to calculated. # return type: MD5 checksum of the contents of the file # with conceptual string # sub getmd5checksum { my $file = shift; if (! -r $file) { return "Not readable"; } else { open (FILE, $file) or return""; $digest->reset(); my $fileInfo = getfileinfo($file); $digest->add($fileInfo); $digest->addfile(*FILE); close (FILE); return $digest->hexdigest; } } # # getfileinfo: # Returns formated string for entered file as # [uid] [gid] [mode] [ino] # # parameters: # string filename: filename of which we need to # find the conceptual string. # return type: string: return conceptual string in # format [uid] [gid] [permission] [ino] # sub getfileinfo { my $file = shift; my (undef, $ino, $mode, undef, $uid, $gid) = stat($file); my $oct = sprintf("%o", $mode & 07777); return $uid . " " . $gid . " " . $oct . " " . $ino; } # # trimall: # used to trim leading and trailing white space characters # from string. # parameters: # string str: input string from which spaces needs # to be removed. # return type : string # trimed string. # sub trimall { my $arg = shift; $arg =~ s/^\s+|\s+$//g; return $arg; }

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

#! /usr/bin/perl -w

use strict;

use Digest::MD5 qw(md5_base64);

use constant SUCESS => 0; # Code for succesful execution.

use constant ERROR => 1; # Code for error in execution.

use warnings;

use File::Copy qw(move);

use Text::CSV;

my $exit_status = 0;

my $digest = Digest::MD5->new;

if ($#ARGV < 0) {

print " Inssuficient number of arguments";

print "\nUsage : $0 [dir1]..\n";

exit ERROR;

}

# Check provided directories exists.

# If not exists then exit showing error,

# otherwise process the directory.

# Directiories with white space parsed in the argument to the script

# must not include whitespace unless in quotes

foreach my $dirname (@ARGV) {

chomp ($dirname);

$dirname = trimall($dirname);

if (! -d $dirname) {

print "$dirname is not directory \n";

$exit_status = ERROR;

next;

} elsif (! -r $dirname) {

print "$dirname is not readable. \n";

$exit_status = ERROR;

next;

}

processfiles($dirname);

}

print "\n";

exit $exit_status;

# processfiles:

# Read recursively through the directory.

# If it is file calculate MD5 check sum of

# conceptual string from the file's

# [userid] [groupid] [permission] [inode]

# and file content. Print the file full path

# and MD5 check sum against it.

# parameters:

# string filename: input file for processing.

# return type: integer: return sucess/failure code.

sub processfiles

{

use Cwd 'abs_path';

#my $dirname = $_[0];

my $dirname = abs_path($_[0]);

chomp ($dirname);

opendir(DIRH, $dirname);

# Read all the files and directories execluding the current

# '.' and parent directory '..'

my @files = sort (grep { !/^\.|\.\.}$/ } readdir (DIRH));

closedir(DIRH);

my $file;

foreach $file (@files) {

my $fullpath = $dirname . "/" . $file;

#print "\n$fullpath";

if (-d "$fullpath") {

processfiles("$fullpath");

} else {

my $md5code=getmd5checksum ("$fullpath");

#Unix epoch time

my $mdate= (stat $fullpath)[9];

my ($ext) = $file =~ /(\.[^.]+)$/;

my $dirname2 = "C:/Users/GunnarLaier/Downloads/HashKey";

my $newfullpath = $dirname2 . "/" . $md5code . "_" .$mdate . $ext;

move $fullpath, $newfullpath;

print "$newfullpath" . "\n";

}

return 0;

}

# getmd5checksum:

# Generate a 64 bit Hex MD5 checksum for file contents

# and string [userid] [groupid] [permission] [inode]

# parametrs: string file: file of which MD5 check sum need

# to calculated.

# return type: MD5 checksum of the contents of the file

# with conceptual string

sub getmd5checksum

{

my $file = shift;

if (! -r $file) {

return "Not readable";

} else {

open (FILE, $file) or return"";

$digest->reset();

my $fileInfo = getfileinfo($file);

$digest->add($fileInfo);

$digest->addfile(*FILE);

close (FILE);

return $digest->hexdigest;

}

# getfileinfo:

# Returns formated string for entered file as

# [uid] [gid] [mode] [ino]

# parameters:

# string filename: filename of which we need to

# find the conceptual string.

# return type: string: return conceptual string in

# format [uid] [gid] [permission] [ino]

sub getfileinfo

{

my $file = shift;

my (undef, $ino, $mode, undef, $uid, $gid) = stat($file);

my $oct = sprintf("%o", $mode & 07777);

return $uid . " " . $gid . " " . $oct . " " . $ino;

}

# trimall:

# used to trim leading and trailing white space characters

# from string.

# parameters:

# string str: input string from which spaces needs

# to be removed.

# return type : string

# trimed string.

sub trimall

{

my $arg = shift;

$arg =~ s/^\s+|\s+$//g;

return $arg;

}

I skriptet har jeg angivet en mappe, hvor alle filerne bliver overført med nyt navn i formatet 34246d8224ed1b737f687b5db539f549_1626526406. Hvilket er md5sum underscore UNIX epoc, ligesom skriptet overfører filtype på den originale fil, så denne kan åbnes uden problemer. Dernæst kører jeg det andet skript sortSelect.pl, hvor kun den nyeste fil vælges. Her sammenligner jeg udfra md5sum. Det bragte min samling ned fra 110GB til 71GB, i alt 37937 filer.

Skriptet, der foretager den endelige sammenligning er næsten identisk med ovenstående. De to skripts kan slås sammen til eet skript med en option som input, ligesom stier til mapper kan angives som argumenter til skriptet. Dertil vil jeg udvide skriptet, så jeg får en csv fil med filnavn, filtype, sti, dato/UNIX epoc, filstørrelse samt md5sum og en kolonne, hvor det angives om filen er kandidat til sletning eller ej. Dette kan bruges til at sortere i billeder, dokumenter, lydfiler og meget mere. Opdateringen bliver emne for et senere blogindlæg. Arbejdsformen giver mulighed for, i kombination med søgealgoritmer, at reducere problemer relateret til data management i hverdagen, men vandt aldrig indpas på dekstoppen, hvor fremtiden snarere er journaliseringssystemer og versionskontrol til også privat brug og med større og større integration af online kommunikation og telefonsystemer.

sortSelect.pl

#! /usr/bin/perl -w use strict; use constant SUCESS => 0; # Code for succesful execution. use constant ERROR => 1; # Code for error in execution. use warnings; use Digest::MD5 qw(md5_base64); use File::Copy qw(copy); my $exit_status = 0; if ($#ARGV < 0) { print " Inssuficient number of arguments"; print "\nUsage : $0 [dir1]..\n"; exit ERROR; } # # Check provided directories exists. # If not exists then exit showing error, # otherwise process the directory. # foreach my $dirname (@ARGV) { chomp ($dirname); $dirname = trimall($dirname); if (! -d $dirname) { print "$dirname is not directory \n"; $exit_status = ERROR; next; } elsif (! -r $dirname) { print "$dirname is not readable. \n"; $exit_status = ERROR; next; } processfiles($dirname); } print "\n"; exit $exit_status; # # processfiles: # Read recursively through the directory. # If it is file calculate MD5 check sum of # conceptual string from the file's # [userid] [groupid] [permission] [inode] # and file content. Print the file full path # and MD5 check sum against it. # # parameters: # string filename: input file for processing. # return type: integer: return sucess/failure code. # sub processfiles { use Cwd 'abs_path'; #my $dirname = $_[0]; my $dirname = abs_path($_[0]); chomp ($dirname); opendir(DIRH, $dirname); # # Read all the files and directories execluding the current # '.' and parent directory '..' # my @files = sort (grep { !/^\.|\.\.}$/ } readdir (DIRH)); closedir(DIRH); my $file; my $md5textold="string"; foreach $file (@files) { my $fullpath = $dirname . "/" . $file; #print "\n$fullpath"; my ($md5text) = $file =~ /(^[^_]+)/; if (-d "$fullpath") { processfiles("$fullpath"); } else { if ($md5textold ne $md5text) { my $dirname2 = "C:/Users/GunnarLaier/Downloads/HashKey"; my $newfullpath = $dirname2 . "/" . $file; copy $fullpath, $newfullpath; print "$newfullpath" . "\n"; } else{ #unlink($fullpath); } $md5textold=$md5text; } } return 0; } # # getfileinfo: # Returns formated string for entered file as # [uid] [gid] [mode] [ino] # # parameters: # string filename: filename of which we need to # find the conceptual string. # return type: string: return conceptual string in # format [uid] [gid] [permission] [ino] # sub getfileinfo { my $file = shift; my (undef, $ino, $mode, undef, $uid, $gid) = stat($file); my $oct = sprintf("%o", $mode & 07777); return $uid . " " . $gid . " " . $oct . " " . $ino; } # # trimall: # used to trim leading and trailing white space characters # from string. # parameters: # string str: input string from which spaces needs # to be removed. # return type : string # trimed string. # sub trimall { my $arg = shift; $arg =~ s/^\s+|\s+$//g; return $arg; }

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

#! /usr/bin/perl -w

use strict;

use constant SUCESS => 0; # Code for succesful execution.

use constant ERROR => 1; # Code for error in execution.

use warnings;

use Digest::MD5 qw(md5_base64);

use File::Copy qw(copy);

my $exit_status = 0;

if ($#ARGV < 0) {

print " Inssuficient number of arguments";

print "\nUsage : $0 [dir1]..\n";

exit ERROR;

}

# Check provided directories exists.

# If not exists then exit showing error,

# otherwise process the directory.

foreach my $dirname (@ARGV) {

chomp ($dirname);

$dirname = trimall($dirname);

if (! -d $dirname) {

print "$dirname is not directory \n";

$exit_status = ERROR;

next;

} elsif (! -r $dirname) {

print "$dirname is not readable. \n";

$exit_status = ERROR;

next;

}

processfiles($dirname);

}

print "\n";

exit $exit_status;

# processfiles:

# Read recursively through the directory.

# If it is file calculate MD5 check sum of

# conceptual string from the file's

# [userid] [groupid] [permission] [inode]

# and file content. Print the file full path

# and MD5 check sum against it.

# parameters:

# string filename: input file for processing.

# return type: integer: return sucess/failure code.

sub processfiles

{

use Cwd 'abs_path';

#my $dirname = $_[0];

my $dirname = abs_path($_[0]);

chomp ($dirname);

opendir(DIRH, $dirname);

# Read all the files and directories execluding the current

# '.' and parent directory '..'

my @files = sort (grep { !/^\.|\.\.}$/ } readdir (DIRH));

closedir(DIRH);

my $file;

my $md5textold="string";

foreach $file (@files) {

my $fullpath = $dirname . "/" . $file;

#print "\n$fullpath";

my ($md5text) = $file =~ /(^[^_]+)/;

if (-d "$fullpath") {

processfiles("$fullpath");

} else {

if ($md5textold ne $md5text) {

my $dirname2 = "C:/Users/GunnarLaier/Downloads/HashKey";

my $newfullpath = $dirname2 . "/" . $file;

copy $fullpath, $newfullpath;

print "$newfullpath" . "\n";

}

else{

#unlink($fullpath);

}

$md5textold=$md5text;

}

return 0;

}

# getfileinfo:

# Returns formated string for entered file as

# [uid] [gid] [mode] [ino]

# parameters:

# string filename: filename of which we need to

# find the conceptual string.

# return type: string: return conceptual string in

# format [uid] [gid] [permission] [ino]

sub getfileinfo

{

my $file = shift;

my (undef, $ino, $mode, undef, $uid, $gid) = stat($file);

my $oct = sprintf("%o", $mode & 07777);

return $uid . " " . $gid . " " . $oct . " " . $ino;

}

# trimall:

# used to trim leading and trailing white space characters

# from string.

# parameters:

# string str: input string from which spaces needs

# to be removed.

# return type : string

# trimed string.

sub trimall

{

my $arg = shift;

$arg =~ s/^\s+|\s+$//g;

return $arg;

}

Identifikation af større filer er også nyttigt, og kan også ske med brug af Powershell. Her en kommando til at danne en liste over filer på mere end 500MB.

PowerShell

Get-ChildItem C:\Users\GunnarLaier\Documents\FilerArkivRåData\ -recurse | where-object {$_.length -gt 524288000} | Sort-Object length | ft fullname, length -auto

ASTORIAL.DK - Salg af bolig i 9400

Filoprydning og tagging

Seneste blogindlæg

コメント