Sponsored Content
Top Forums Shell Programming and Scripting Finding longest common substring among filenames Post 302267476 by cmcnorgan on Friday 12th of December 2008 12:29:27 PM
Old 12-12-2008
Finding longest common substring among filenames

I will be performing a task on several directories, each containing a large number of files (2500+) that follow a regular naming convention:

YYYY_MM_DD_XX.foo_bar.A.B.some_different_stuff.EXT

What I would like to do is automatically discover the part of the filenames that are common to all 2500 files, so that a script could use that as a base name. In practice, this will end up being "YYYY_MM_DD_XX.foo_bar."

I figured out as far as I'll have to use ls to get all the filenames, but there's no command that I know of that will find the largest substring that exists among a large number of strings. I thought perhaps there would be some sed guru out there that would find this problem trivial. You sed experts always blow my mind.
 

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

finding the last substring...

hii, i want to know the shell command for finding the last occurance of a substring in string.. i can use grep command or sed to find out the occurance of a substring in a string but how do i find out the last occurance.shud i use grep amd and cut the string everytime and store it in a new... (7 Replies)
Discussion started by: cutelucks
7 Replies

2. Shell Programming and Scripting

Finding the most common entry in a column

Hi, I have a file with 3 columns in it that are comma separated and it has about 5000 lines. What I want to do is find the most common value in column 3 using awk or a shell script or whatever works! I'm totally stuck on how to do this. e.g. value1,value2,bob value1,value2,bob... (12 Replies)
Discussion started by: Donkey25
12 Replies

3. Shell Programming and Scripting

Finding duplicates from positioned substring across lines

I have million's of records each containing exactly 50 characters and have to check the uniqueness of 4 character substring of 50 character (postion known prior) and report if any duplicates are found. Eg. data... AAAA00000000000000XXXX0000 0000000000... upto50 chars... (2 Replies)
Discussion started by: gapprasath
2 Replies

4. Shell Programming and Scripting

Finding longest line in a Record

Good Morning/Afternoon All, I am using the nawk utility in korn shell to find the longest field and display that result. My Data is as follows: The cat ran The elephant ran Milly ran too We all ran I have tried nawk '{ if (length($1) > len) len=length($1); print $1}' filename The... (5 Replies)
Discussion started by: SEinT
5 Replies

5. Shell Programming and Scripting

Finding the length of the longest column

Hi, I am trying to figure out how to get the length of the longest column in the entire file (because the length varies from one row to the other) I was doing this at first to check how many fields I have for the first row: awk '{print NF; exit}' file Now, I can do this: awk '{ if... (4 Replies)
Discussion started by: MIA651
4 Replies

6. Shell Programming and Scripting

Finding most common substrings

Hello, I would like to know what is the three most abundant substrings of length 6 from col2. The file is quite large and looks like this col1 col2 EN03 typehellobyedogcatcatdog EN09 typehellobyebyebyebye EN08 dogcatcatdogbyebyebyebye EN09 catcattypehellobyebyebyebye... (9 Replies)
Discussion started by: verse123
9 Replies

7. Shell Programming and Scripting

Parsing the longest match substring

Hello gurus, I have a database of possible primary signal strings pp22 pt22dx pp22dx jty2234 Also I have a list of scrambled signals which has a shorter string and a longer string separated by // (double slash ). Always the shorter string of a scrambled signal will have the primary... (6 Replies)
Discussion started by: senhia83
6 Replies

8. UNIX for Beginners Questions & Answers

Finding common entries between 10 columns

Hello, I need to find the intersection across 10 columns. Kindly help. my file (INPUT.csv) looks like this 4_R 4_S 8_R 8_S 12_R 12_S 24_R 24_S LOC_Os01g01010 LOC_Os01g01010 LOC_Os01g01010 LOC_Os04g48290 LOC_Os01g01010 LOC_Os01g01010... (1 Reply)
Discussion started by: Sanchari
1 Replies

9. UNIX for Beginners Questions & Answers

Finding a word through substring in a file

I have a text file that has some data like: PADHOGOA1 IOP055_VINREG5_1 ( .IO(VINREG5_1), .MONI(), .MON_D(px_IOP055_VINREG5_1_MON_D), .R0T(px_IOP054_VINREG5_0_R0T), .IO1() ); PADV30MA0 IOP056_VOUT3_IN ( .IO(VOUT3_IN), .V30M(px_IOP056_VOUT3_IN_V30M)); PADV30MA0 IOP057_VOUT3_OUT (... (2 Replies)
Discussion started by: utkarshkhanna44
2 Replies

10. UNIX for Beginners Questions & Answers

Replace substring by longest string in common field (awk)

Hi, Let's say I have a pipe-separated input like so: name_10|A|BCCC|cat_1 name_11|B|DE|cat_2 name_10|A|BC|cat_3 name_11|B|DEEEEEE|cat_4 Using awk, for records with common field 2, I am trying to replace all the shortest substrings by the longest string in field 3. In order to get the... (5 Replies)
Discussion started by: beca123456
5 Replies
sed(1B) 					     SunOS/BSD Compatibility Package Commands						   sed(1B)

NAME
sed - stream editor SYNOPSIS
sed [-n] [-e script] [-f sfilename] [filename]... DESCRIPTION
The sed utility copies the filenames (standard input default) to the standard output, edited according to a script of commands. OPTIONS
The following options are supported: -n Suppresses the default output. -e script script is an edit command for sed. If there is just one -e option and no -f options, the -e flag may be omitted. -f sfilename Takes the script from sfilename. USAGE
sed Scripts sed scripts consist of editing commands, one per line, of the following form: [ address [, address ] ] function [ arguments ] In normal operation, sed cyclically copies a line of input into a pattern space (unless there is something left after a D command), sequen- tially applies all commands with addresses matching that pattern space until reaching the end of the script, copies the pattern space to the standard output (except under -n), and finally, deletes the pattern space. Some commands use a hold space to save all or part of the pattern space for subsequent retrieval. An address is either: o a decimal number linecount, which is cumulative across input files; o a $, which addresses the last input line; o or a context address, which is a /regular expression/ as described on the regexp(5) manual page, with the following exceptions: ?RE? In a context address, the construction ?regular expression?, where ? is any character, is identical to /regu- lar expression/. Note: in the context address xabcxdefx, the second x stands for itself, so that the regular expression is abcxdef. Matches a NEWLINE embedded in the pattern space. . Matches any character except the NEWLINE ending the pattern space. null A command line with no address selects every pattern space. address Selects each pattern space that matches. address1 ,address2 Selects the inclusive range from the first pattern space matching address1 to the first pattern space matching address2. Selects only one line if address1 is greater than or equal to address2. Comments If the first nonwhite character in a line is a `#' (pound sign), sed treats that line as a comment, and ignores it. If, however, the first such line is of the form: #n sed runs as if the -n flag were specified. Functions The maximum number of permissible addresses for each function is indicated in parentheses in the list below. An argument denoted text consists of one or more lines, all but the last of which end with to hide the NEWLINE. Backslashes in text are treated like backslashes in the replacement string of an s command, and may be used to protect initial SPACE and TAB characters against the stripping that is done on every script line. An argument denoted rfilename or wfilename must terminate the command line and must be preceded by exactly one SPACE. Each wfilename is created before processing begins. There can be at most 10 distinct wfilename arguments.(1)a Append: place text on the output before reading the next input line. text(2)b label Branch to the `:' command bearing the label. Branch to the end of the script if label is empty.(2)c Change: delete the pattern space. With 0 or 1 address or at the end of a 2 address range, place text on the output. Start text the next cycle.(2)d Delete the pattern space. Start the next cycle.(2)D Delete the initial segment of the pattern space through the first NEWLINE. Start the next cycle.(2)g Replace the contents of the pattern space by the contents of the hold space.(2)G Append the contents of the hold space to the pattern space.(2)h Replace the contents of the hold space by the contents of the pattern space.(2)H Append the contents of the pattern space to the hold space.(1)i Insert: place text on the standard output. text(2)l List the pattern space on the standard output in an unambiguous form. Non-printing characters are spelled in two digit ASCII and long lines are folded.(2)n Copy the pattern space to the standard output. Replace the pattern space with the next line of input.(2)N Append the next line of input to the pattern space with an embedded newline. (The current line number changes.) (2)p Print: copy the pattern space to the standard output.(2)P Copy the initial segment of the pattern space through the first NEWLINE to the standard output.(1)q Quit: branch to the end of the script. Do not start a new cycle.(2)r rfilename Read the contents of rfilename. Place them on the output before reading the next input line.(2)s/regular expression/replacement/flags Substitute the replacement string for instances of the regular expression in the pattern space. Any character may be used instead of `/'. For a fuller description see regexp(5). flags is zero or more of: n n= 1 - 512. Substitute for just the nth occurrence of the regularexpression. g Global: substitute for all nonoverlapping instances of the regular expression rather than just the first one. p Print the pattern space if a replacement was made. w wfilename Write: append the pattern space to wfilename if a replacement was made. (2t label Test: branch to the `:' command bearing the label if any substitutions have been made since the most recent read- ing of an input line or execution of a t. If label is empty, branch to the end of the script.(2)w wfilename Write: append the pattern space to wfilename.(2)x Exchange the contents of the pattern and hold spaces.(2)y/string1/string2/ Transform: replace all occurrences of characters in string1 with the corresponding character in string2. The lengths of string1 and string2 must be equal.(2)! function Do not: apply the function (or group, if function is `{') only to lines not selected by the address(es). (0): label This command does nothing. It bears a label for b and t commands to branch to. Note: The maximum length of label is seven characters.(1)= Place the current line number on the standard output as a line.(2){ Execute the following commands through a matching `}' only when the pattern space is selected. Commands are sepa- rated by `;'. (0) An empty command is ignored. Large Files See largefile(5) for the description of the behavior of sed when encountering files greater than or equal to 2 Gbyte (2**31 bytes). DIAGNOSTICS
Too many commands The command list contained more than 200 commands. Too much command text The command list was too big for sed to handle. Text in the a, c, and i commands, text read in by r commands, addresses, regular expressions and replacement strings in s commands, and translation tables in y commands all require sed to store data internally. Command line too long A command line was longer than 4000 characters. Too many line numbers More than 256 decimal number linecounts were specified as addresses in the command list. Too many files in w commands More than 10 different files were specified in w commands or w options for s commands in the command list. Too many labels More than 50 labels were specified in the command list. Unrecognized command A command was not one of the ones recognized by sed. Extra text at end of command A command had extra text after the end. Illegal line number An address was neither a decimal number linecount, a $, nor a context address. Space missing before filename There was no space between an r or w command, or the w option for a s command, and the filename specified for that command. Too many {'s There were more { than } in the list of commands to be executed. Too many }'s There were more } than { in the list of commands to be executed. No addresses allowed A command that takes no addresses had an address specified. Only one address allowed A command that takes one address had two addresses specified. "digit" out of range The number in a item in a regular expression or a replacement string in ans command was greater than 9. Bad number One of the endpoints in a range item in a regular expression (that is, an item of the form {n} or {n,m}) was not a number. Range endpoint too large One of the endpoints in a range item in a regular expression was greater than 255. More than 2 numbers given in { } More than two endpoints were given in a range expression. } expected after A appeared in a range expression and was not followed by a }. First number exceeds second in { } The first endpoint in a range expression was greater than the second. Illegal or missing delimiter The delimiter at the end of a regular expression was absent. ( ) imbalance There were more ( than ), or more ) than (, in a regular expression. [ ] imbalance There were more [ than ], or more ] than [, in a regular expression. First RE may not be null The first regular expression in an address or in a s command was null (empty). Ending delimiter missing on substitution The ending delimiter in a s command was absent. Ending delimiter missing on string The ending delimiter in a y command was absent. Transform strings not the same size The two strings in a y command were not the same size. Suffix too large - 512 max The suffix in a s command, specifying which occurrence of the regular expression should be replaced, was greater than 512. Label too long A label in a command was longer than 8 characters. Duplicate labels The same label was specified by more than one : command. File name too long The filename specified in a r or w command, or in the w option for a s command, was longer than 1024 characters. Output line too long An output line was longer than 4000 characters long. Too many appends or reads after line n More than 20 a or r commands were to be executed for line n. Hold space overflowed. More than 4000 characters were to be stored in the hold space. FILES
usr/ucb/sed BSD sed ATTRIBUTES
See attributes(5) for descriptions of the following attributes: +-----------------------------+-----------------------------+ | ATTRIBUTE TYPE | ATTRIBUTE VALUE | +-----------------------------+-----------------------------+ |Availability |SUNWscpu | +-----------------------------+-----------------------------+ SEE ALSO
awk(1), grep(1), lex(1), attributes(5), largefile(5), regexp(5) BUGS
There is a combined limit of 200 -e and -f arguments. In addition, there are various internal size limits which, in rare cases, may over- flow. To overcome these limitations, either combine or break out scripts, or use a pipeline of sed commands. SunOS 5.10 28 Mar 1995 sed(1B)
All times are GMT -4. The time now is 03:10 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy