awk issue splitting a fixed-width file containing line feed in data


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting awk issue splitting a fixed-width file containing line feed in data
# 1  
Old 02-16-2017
awk issue splitting a fixed-width file containing line feed in data

Hi Forum.

I have the following script that splits a large fixed-width file into smaller multiple fixed-width files based on input segment type.

The main command in the script is:
Code:
awk -v search_col_pos=$search_col_pos -v search_str_len=$search_str_len -v segment_type="$segment_type" 'substr($0, search_col_pos, search_str_len) ~ segment_type {print $0}'

To call the script:
Code:
file_splitter.sh filename CO 33 2

Where CO is the segment_type to search for ($segment_type)
           33 is the start search position ($search_col_pos)
             2 is the length of the search string ($search_str_len)

The issue that I'm having is that some of the data columns in the source file contains some binary values along with a LF (Line Feed) character in it.

As a result, the awk command interprets the LF in the data as the end of the line and only partially returns the record. Anything after the LF in the data column is not returned. Therefore, I have an incomplete record.

How do I handle this issue?

Thanks.
# 2  
Old 02-16-2017
Wouldn't a decent, representative input sample be nice?
# 3  
Old 02-16-2017
Data Sample

Sure can. Thanks.

LF is on the 2nd record (After bolded red text)

Code:
00003061895700000000208001X UK1001249751+0000000752+0000000753+0000000754+0000000755+0000000756+0000000757+0000000758+0000000759+0000000760+0000000761+0000000762+0000000763+0000000764+0000000765+0000000766+0000000767+0000000768+0000000769+0000000770+0000000771+0000000772+0000000773+0000000774+0000000775+0000000776+0000000777+0000000778+0000000779+0000000780+0000000781+0000000782+0000000783+0000000784+0000000785+0000000786+0000000787+0000000788+0000000789+0000000790+0000000791+0000000792+0000000793+0000000794+0000000795+0000000796+0000000797+0000000798+0000000799+0000000800+0000000801+0000000802+0000000803+0000000804+0000000805+0000000806+0000000807+0000000808+0000000809+0000000810+0000000811+0000000812+0000000813+0000000814+0000000815+0000000816+0000000817+0000000818+0000000819+0000000820+0000000821+0000000822+0000000823+0000000824+0000000825+0000000826+0000000827+0000000828+0000000829+0000000830+0000000831+0000000832+0000000833+0000000834+0000000835+0000000836+0000000837+0000000838+0000000839+0000000840+0000000841+0000000842+0000000843+0000000844+0000000845+0000000846+0000000847+0000000848+0000000849+0000000850+0000000851+0000000852+0000000853+0000000854+0000000855+0000000856+0000000857+0000000858+0000000859+0000000860+0000000861+0000000862+0000000863+0000000864+0000000865+0000000866+0000000867+0000000868+0000000869+0000000870+0000000871+0000000872+0000000873+0000000874+0000000875+0000000876+0000000877+0000000878+0000000879+0000000880+0000000881+0000000882+0000000883+0000000884+0000000885+0000000886+0000000887+0000000888+0000000889+0000000890+0000000891+0000000892+0000000893+0000000894+0000000895+0000000896+0000000897+0000000898+0000000899+0000000900+0000000901+0000000902+0000000903+0000000904+0000000905+0000000906+0000000907+0000000908+0000000909+0000000910+0000000911+0000000912+0000000913+0000000914+0000000915+0000000916+0000000917+0000000918+0000000919+0000000920+0000000921+0000000922+0000000923+0000000924+0000000925+0000000926+0000000927+0000000928+0000000929+0000000930+0000000931+0000000932+0000000933+0000000934+0000000935+0000000936+0000000937+0000000938+0000000939+0000000940+0000000941+0000000942+0000000943+0000000944+0000000945+0000000946+0000000947+0000000948+0000000949+0000000950+0000000951+0000000952+0000000953+0000000954+0000000955+0000000956+0000000957+0000000958+0000000959+0000000960+0000000961+0000000962+0000000963+0000000964+0000000965+0000000966+0000000967+0000000968+0000000969+0000000970+0000000971+0000000972+0000000973+0000000974+0000000975+0000000976+0000000977+0000000978+0000000979+0000000980+0000000981+0000000982+0000000983+0000000984+0000000985+0000000986+0000000987+0000000988+0000000989+0000000990+0000000991+0000000992+0000000993+0000000994+0000000995+0000000996+0000000997+0000000998+0000000999+0000000
00003061895700000000208001X UF1101600000000000000000001001111111111110000000000000000000000000000000000000000000000000000100002000000000000000000000ON 000E MC02+000000000+000000000+000000091+000000000+000000100002+000001100004+000001420005+000002020003+000001200+000000000+000000000000000+000000404+000000420+000000020+000000018001+00000010000006000270000000000000000000000078200000000000001000000000000000000000+199500+199500+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000002+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+00000000000000900000100000000000000100000000000000000000240000000000+0001003+0000001+0000001+0000010+0000103+0000000+0000000+0000000+0000000+0000000+0000086+00000000000000000000000000020170220170208201612000000000000000000000000000000000000+00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000020150900201509N00000000000000000000000000000000000000000000000000000000000000000000000400N    5    166  7                             17029029                                                    002004005006                                                                                                        
‡     ƒÀ       D                     -ƒ                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
00003061895700000000212001X CU0101306189570000000021200 X 00000000000000000000201509150000000+00000000100001982030600000000000000000001009915903222177622001000004999001001+00001200020150915+000012000+000000000+0000000000001000004999001001+00000360020150915+000003600+000000000+0000000000001000004999001001+00000000000000000+000000500+000000000+0000000000000000000000000000+00000000000000000+000000000+000000000+0000000000000000000000000000+00000000000000000+000000000+000000000+0000000000000000000000000000+00000000000000000+000000000+000000000+0000000000000000000000000000+00000000000000000+000000000+000000000+0000000000000000000000000000+00000000000000000+000000000+000000000+0000000000000000000000000000+00000000000000000+000000000+000000000+0000000000000000000000000000+00000000000000000+000000000+000000000+0000000000+000000000+000000000+000000000+000000000+000000000+000000000+000000000+0000000000000000000+000000000+00000000000+000000000+000000000+000000000+000000000+000000000+000000000+000000000+000000000+000000000+00000000000000000000000000000+000000000+000000000+000000000+000000000+000000000+000000000     00000000+000000000     00000000+00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000008040080200802008020082800828008280080200802000000080100801000000000+00023876+00000000CON120170208000P010

# 4  
Old 02-16-2017
So - the records are fixed width PLUS <LF> terminated? How come a "split" record is WAY longer than a normal record (not counting the spaces at the end), and why do the two normal records (1 & 3) differ in length? None of the records seem to have structure elements in common?
# 5  
Old 02-16-2017
Hi RudiC.

If you look at position 33, there are 3 different types of segments (UK, UX, CU).

That's why we cannot process the source file as is since each segment can have different data lengths.

We need to split the source file by segment types first and then process each sub file separately.

I hope I answered your question.

Thanks.

Last edited by pchang; 02-16-2017 at 05:08 PM.. Reason: more detail info
# 6  
Old 02-16-2017
for a program to process this file correctly you will need to list every possible segment type and it's length. The awk script can then fetch characters without relying on <LF> characters which can randomly appear within the actual segment.
# 7  
Old 02-17-2017
Try something like this to reconstruct the lines before processing:
Code:
awk -v search_col_pos=$search_col_pos -v search_str_len=$search_str_len -v segment_type="$segment_type" '
  function cond_print(record) { 
    if (substr(record, search_col_pos, search_str_len) ~ segment_type) print record
  }
  !/^[0-9]{25}/ {
    prev=prev RS $0
    next
  }
  {
    if(NR>1) cond_print(prev);
    prev=$0
  } 
  END {
    cond_print(prev)
  }
' file


Last edited by Scrutinizer; 02-17-2017 at 01:54 AM..
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Answers to Frequently Asked Questions

Fixed width file issue and resolutions

I have a fixed width file with no separators , but the file has JUNK characters in it and i know how to remove it. but in few cases these junk characters having created a mess by splitting single row into 2 or 3 rows. I need to put them back to a single line like rest of the rows. This fixed... (2 Replies)
Discussion started by: santoshkm
2 Replies

2. Shell Programming and Scripting

Replace using awk on fixed width file.

All, I used to use following command to replace specific location in a fixed width file. Recently looks like my command stopped working as intended. We are on AIX unix. awk 'function repl(s,f,t,v) { return substr(s,1,f-1) sprintf("%-*s", t-f+1, v) substr(s,t+1) } NR<=10 {... (3 Replies)
Discussion started by: pinnacle
3 Replies

3. Post Here to Contact Site Administrators and Moderators

How to sum up data in fixed width file with decimal point?

HI Everyone, I have below source file AAA|NAME1|ADDRESS1|300.20 BBB|NAME2|ADDRESS2|400.31 CCC|NAME3|ADDRESS3|300.34 I have requirement where I need to sum up fourth field in above fixed width pipe delimited flat file. When I use below code, it gives me value 1001.00 But I am expecting... (1 Reply)
Discussion started by: patricjemmy6
1 Replies

4. Shell Programming and Scripting

Print column details from fixed width file using awk command

hi, i have a fixed width file with multiple columns and need to print data using awk command. i use: awk -F "|" '($5 == BH) {print $1,$2,$3}' <non_AIM target>.txt for a delimiter file. but now i have a fixed width file like below: 7518 8269511BH 20141224951050N8262 11148 8269511BH... (5 Replies)
Discussion started by: kcdg859
5 Replies

5. Shell Programming and Scripting

Remove new line character and add space to convert into fixed width file

I have a file with different record length. The file as to be converted into fixed length by appending spaces at the end of record. The length should be calculated based on the record with maximum length in the file. If the length is less than the max length, the spaces should be appended... (4 Replies)
Discussion started by: Amrutha24
4 Replies

6. Shell Programming and Scripting

Ignore Header and Footer and Sort the data in fixed width file

Hi Experts, I want to Sort the data in fixed width file where i have Header and Footer also in file. I m using below commad to do the sort based on field satarting from 15 position to 17 position , but it is not ignoring the Header and Footer of the file while sorting. In the output i am... (5 Replies)
Discussion started by: sasikari
5 Replies

7. Shell Programming and Scripting

awk: creating a fixed-width single file from 2 different files

I have to create a single file from three files, Please see below for samples: day.txt 20090101 20090102 item.txt 123456789101 12345678910209 1234567891 str.txt 1 12 123 output.txt 20090101123456789101 1 0 2009010112345678910209 12 ... (2 Replies)
Discussion started by: tamahomekarasu
2 Replies

8. Shell Programming and Scripting

Appending string (charachters inside the line) to a fixed width file using awk or sed

Source File: abcdefghijklmnop01qrstuvwxyz abcdefghijklmnop02qrstuvwxyz abcdefghijklmnop03qrstuvwxyz abcdefghijklmnop04qrstuvwxyz abcdefghijklmnop05qrstuvwxyz Whatever characters are in 17-18 on each line of the file, it should be concatenated to the same line at the character number... (6 Replies)
Discussion started by: tamahomekarasu
6 Replies

9. Shell Programming and Scripting

edit entire column from a fixed-width file using awk or sed

Col1 Col2 Col3 Col4 12 Completed 08 0830 12 In Progress 09 0829 11 For F U 07 0828 Considering the file above, how could i replace the third column the most efficient way? The actual file size is almost 1G. I am... (10 Replies)
Discussion started by: tamahomekarasu
10 Replies

10. UNIX for Dummies Questions & Answers

Fixed Width file using AWK

I am using the following command at the Unix prompt to make my 'infile' into a fixed width file of 100 characters. awk '{printf "%-100s\n",$0}' infile > outfile However, there are some records with a special character "©" These records are using 3 characters in place of one and my record... (2 Replies)
Discussion started by: alok.benjwal
2 Replies
Login or Register to Ask a Question