Delimit file based on character length using awk

02-23-2016

Registered User

43, 1

Join Date: May 2013

Last Activity: 22 July 2020, 3:23 PM EDT

Posts: 43

Thanks Given: 33

Thanked 1 Time in 1 Post

Delimit file based on character length using awk

Hi,

I need help with one problem, I came across recently.

I have one input file which I need to delimit based on character length.

Code:

$ cat Input.txt
12345sda231453
asd760kjol62569
sdasw4g76gdf57

And, There is one comma separated file which mentions "start of the field" and "length of the field".

Code:

$ cat start_length.csv
1,2
3,3
6,3
9,

Expected output is as follows:

Code:

12|345|sda|231453
as|d76|0kj|ol62569
sd|asw|4g7|6gdf57

I have used awk to get the expected result as follows:

Code:

$ awk 'BEGIN{OFS="|"}{print substr($0,1,2),substr($0,3,3),substr($0,6,3),substr($0,9)}' Input.txt
12|345|sda|231453
as|d76|0kj|ol62569
sd|asw|4g7|6gdf57

But, the problem here is I have hardcoded "start of the field" and "length of the field" in above awk. We have bigger file containing more than 2 lacs record with more than 200 fields. So, It is not possible to hardcode "start of the field" and "length of the field" for each file.

Is there any way in which I can use start_length.csv file and somehow run it in loop to get desired output.

Prathmesh

View Public Profile for Prathmesh

Find all posts by Prathmesh

02-23-2016

Moderator

3,105, 1,603

Join Date: May 2013

Last Activity: 31 August 2020, 1:46 AM EDT

Location: Chennai

Posts: 3,105

Thanks Given: 1,269

Thanked 1,603 Times in 1,369 Posts

Hello Prathmesh,

Could you please try following and let me know if this helps you.

Code:

awk 'FNR==NR{A[++i]=$1;B[i]=$2;next} {for(j=1;j<=i;j++){if(B[j]){C=C?C OFS substr($0,A[j],B[j]):substr($0,A[j],B[j])} else {C=C?C OFS substr($0,A[j]):substr($0,A[j])}};print C;C=""}' FS="," fields OFS="|" main_file

Output will be as follows.

Code:

12|345|sda|231453
as|d76|0kj|ol62569
sd|asw|4g7|6gdf57

Thanks,
R. Singh

This User Gave Thanks to RavinderSingh13 For This Post:

RavinderSingh13

View Public Profile for RavinderSingh13

Find all posts by RavinderSingh13

02-23-2016

Registered User

43, 1

Join Date: May 2013

Last Activity: 22 July 2020, 3:23 PM EDT

Posts: 43

Thanks Given: 33

Thanked 1 Time in 1 Post

Quote:

Originally Posted by RavinderSingh13

Hello Prathmesh,

Could you please try following and let me know if this helps you.

Code:

awk 'FNR==NR{A[++i]=$1;B[i]=$2;next} {for(j=1;j<=i;j++){if(B[j]){C=C?C OFS substr($0,A[j],B[j]):substr($0,A[j],B[j])} else {C=C?C OFS substr($0,A[j]):substr($0,A[j])}};print C;C=""}' FS="," fields OFS="|" main_file

Output will be as follows.

Code:

12|345|sda|231453
as|d76|0kj|ol62569
sd|asw|4g7|6gdf57

Thanks,
R. Singh

Thanks Ravinder. Your code is working fine. But, Can you please explain what it does exactly to understand it better.

Prathmesh

View Public Profile for Prathmesh

Find all posts by Prathmesh

02-23-2016

Moderator

3,105, 1,603

Join Date: May 2013

Last Activity: 31 August 2020, 1:46 AM EDT

Location: Chennai

Posts: 3,105

Thanks Given: 1,269

Thanked 1,603 Times in 1,369 Posts

Hello Prathmesh,

Could you please go through following and let me know if this helps you.

Code:

  
awk 'FNR==NR{                                         ####### This condition will be TRUE only when first file is being read, because FNR will be RESET for each file but NR(Number of recoreds) value will be keep on increasing till last file read.
A[++i]=$1;                                            ####### Once above condition is TRUE then I am creating an array named A whose index is a variable named i, ++i means increse value of variable i and keep it's value same as $1's(first field's) value.
B[i]=$2;                                              ####### Creating an array named B whose index is variable i(note but not increasing the value of variable i here, to keep the same indexes for array A and B). keeping it's value to $2's value which is second field's value.
next}                                                 ####### putting next statment here to skip further all the next actions now.
{for(j=1;j<=i;j++){                                   ####### Now starting a for loop to run it till the value of variable i, which we will get variable i's final value when first file will be completly read.
if(B[j]){                                             ####### Here I am making sure array B's value is NOT NULL(because in your example at last line last field is empty so during next step doing substr I have to check this condition now.
C=C?C OFS substr($0,A[j],B[j]):substr($0,A[j],B[j])}  ####### Creating a variable named C whose value will appended each time with it's own last time value along with the current line's substring's value(Here I am using array A and array B to get the substring where obvioslu array A is for the starting position and array B denotes then length of string.
else {                                                ####### If above condition is NOT true then this else will be executed.
C=C?C OFS substr($0,A[j]):substr($0,A[j])}};          ####### create a variable named C and each time append itself with variable C with it's current line's substring's value. Here difference between the previous substring and now substring is I am not giving the till value eg--> substr(LINE, STARTING point, END Point); because we may have NO END point like your 3rd line in fields file.
print C;                                              ####### printing the variable named C.
C=""}'                                                ####### Nullyfing the variable C.
FS="," fields                                         ####### Mentioning the field seprator for fields file as comma here. NOTE it will not be for second file, awk gives us this facility to set mutiple field seprators for different files according to our requirements.
OFS="|" main_file                                     #######  Mentioning the output field seprator as | here and mentioning Input_file(main_file) here too.

Thanks,
R. Singh

This User Gave Thanks to RavinderSingh13 For This Post:

RavinderSingh13

View Public Profile for RavinderSingh13

Find all posts by RavinderSingh13

02-23-2016

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

For awks that can handle empty field separators, try

Code:

awk 'FNR == NR {S[NR] = $1; CNT = NR; next} {for (i=2; i<=CNT; i++) $S[i] = "|" $S[i]} 1' FS=, file2 FS="" OFS="" file1 
12|345|sda|231453
as|d76|0kj|ol62569
sd|asw|4g7|6gdf57

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

02-23-2016

Registered User

43, 1

Join Date: May 2013

Last Activity: 22 July 2020, 3:23 PM EDT

Posts: 43

Thanks Given: 33

Thanked 1 Time in 1 Post

Quote:

Originally Posted by RavinderSingh13

Hello Prathmesh,

Could you please go through following and let me know if this helps you.

Code:

  
awk 'FNR==NR{                                         ####### This condition will be TRUE only when first file is being read, because FNR will be RESET for each file but NR(Number of recoreds) value will be keep on increasing till last file read.
A[++i]=$1;                                            ####### Once above condition is TRUE then I am creating an array named A whose index is a variable named i, ++i means increse value of variable i and keep it's value same as $1's(first field's) value.
B[i]=$2;                                              ####### Creating an array named B whose index is variable i(note but not increasing the value of variable i here, to keep the same indexes for array A and B). keeping it's value to $2's value which is second field's value.
next}                                                 ####### putting next statment here to skip further all the next actions now.
{for(j=1;j<=i;j++){                                   ####### Now starting a for loop to run it till the value of variable i, which we will get variable i's final value when first file will be completly read.
if(B[j]){                                             ####### Here I am making sure array B's value is NOT NULL(because in your example at last line last field is empty so during next step doing substr I have to check this condition now.
C=C?C OFS substr($0,A[j],B[j]):substr($0,A[j],B[j])}  ####### Creating a variable named C whose value will appended each time with it's own last time value along with the current line's substring's value(Here I am using array A and array B to get the substring where obvioslu array A is for the starting position and array B denotes then length of string.
else {                                                ####### If above condition is NOT true then this else will be executed.
C=C?C OFS substr($0,A[j]):substr($0,A[j])}};          ####### create a variable named C and each time append itself with variable C with it's current line's substring's value. Here difference between the previous substring and now substring is I am not giving the till value eg--> substr(LINE, STARTING point, END Point); because we may have NO END point like your 3rd line in fields file.
print C;                                              ####### printing the variable named C.
C=""}'                                                ####### Nullyfing the variable C.
FS="," fields                                         ####### Mentioning the field seprator for fields file as comma here. NOTE it will not be for second file, awk gives us this facility to set mutiple field seprators for different files according to our requirements.
OFS="|" main_file                                     #######  Mentioning the output field seprator as | here and mentioning Input_file(main_file) here too.

Thanks,
R. Singh

Thanks. I will go through it and let you know in case of any doubt.

---------- Post updated at 08:54 PM ---------- Previous update was at 08:50 PM ----------

Quote:

Originally Posted by RudiC

For awks that can handle empty field separators, try

Code:

awk 'FNR == NR {S[NR] = $1; CNT = NR; next} {for (i=2; i<=CNT; i++) $S[i] = "|" $S[i]} 1' FS=, file2 FS="" OFS="" file1 
12|345|sda|231453
as|d76|0kj|ol62569
sd|asw|4g7|6gdf57

Thanks. Can you please explain code.

Prathmesh

View Public Profile for Prathmesh

Find all posts by Prathmesh

02-23-2016

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

With FS="", every character is a field of its own. The array S holds the char positions from file2, and file1's fields (= chars) identified by S are prefixed with | .

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

Shell Programming and Scripting

Delimit file based on character length using awk

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Add string based on character length

Discussion started by: fastlearner

2. UNIX for Dummies Questions & Answers

Select lines based on character length

Discussion started by: zajtat

3. Shell Programming and Scripting

awk based script to ignore all columns from a file which contains character strings

Discussion started by: ks_reddy

4. Shell Programming and Scripting

File character adjustment based on specific character

Discussion started by: pema.yozer

5. Shell Programming and Scripting

Awk: Searching for length of words between slash character

Discussion started by: vnayak

6. Shell Programming and Scripting

Generate 100 Character Fixed Length Empty File

Discussion started by: jvt

7. Shell Programming and Scripting

Add character based on record length

Discussion started by: CutNPaste

8. Shell Programming and Scripting

print a file with one column having fixed character length

Discussion started by: smriti_shridhar

9. Shell Programming and Scripting

Using Awk script to check length of a character

Discussion started by: amit1_x

10. UNIX for Dummies Questions & Answers

Need find a file based length

Discussion started by: J_ang