Transpose Messy Data


 
Thread Tools Search this Thread
Top Forums UNIX for Advanced & Expert Users Transpose Messy Data
# 1  
Old 05-04-2015
Transpose Messy Data

I have a messy, pipe-delimited ("|") input dataset.

I would like to create a file of ID plus each component of field 4 which is delimited by ";" into a long, skinny shape for easier processing.

A couple of complications are that field 4 may contain both commas and linefeed characters from the source.

Sample data looks like:

Code:
ID1|VAR2|VAR3|VAR4|VAR5
ID2|VAR2|VAR3|PART1;PART2|1;2
ID3|VAR2|VAR3|A, B, C;PART2;BEFORE LF\nAFTER LF|1;2;3
ID4|VAR2|VAR3|1;2;3,;4|1;2;3;4

I would something like data like:

I
Code:
D1|VAR4
ID2|PART1
ID2|PART2
ID3|A, B, C
ID3|PART2
ID3|BEFORE LF  AFTER LF
ID4|1
ID4|2
ID4|3
ID4|4

Is there an elegant way to do this at the command line?

Thanks!

Last edited by Corona688; 05-04-2015 at 04:38 PM..
# 2  
Old 05-04-2015
To keep the forums high quality for all users, please take the time to format your posts correctly.

First of all, use Code Tags when you post any code or data samples so others can easily read your code. You can easily do this by highlighting your code and then clicking on the # in the editing menu. (You can also type code tags [code] and [/code] by hand.)



Second, avoid adding color or different fonts and font size to your posts. Selective use of color to highlight a single word or phrase can be useful at times, but using color, in general, makes the forums harder to read, especially bright colors like red.

Third, be careful when you cut-and-paste, edit any odd characters and make sure all links are working property.

Thank You.

The UNIX and Linux Forums
This User Gave Thanks to Corona688 For This Post:
# 3  
Old 05-04-2015
Quote:
Originally Posted by 91674io
I have a messy, pipe-delimited ("|") input dataset.

I would like to create a file of ID plus each component of field 4 which is delimited by ";" into a long, skinny shape for easier processing.

A couple of complications are that field 4 may contain both commas and linefeed characters from the source.

Sample data looks like:

Code:
ID1|VAR2|VAR3|VAR4|VAR5
ID2|VAR2|VAR3|PART1;PART2|1;2
ID3|VAR2|VAR3|A, B, C;PART2;BEFORE LF\nAFTER LF|1;2;3
ID4|VAR2|VAR3|1;2;3,;4|1;2;3;4

I would something like data like:

I
Code:
D1|VAR4
ID2|PART1
ID2|PART2
ID3|A, B, C
ID3|PART2
ID3|BEFORE LF  AFTER LF
ID4|1
ID4|2
ID4|3
ID4|4

Is there an elegant way to do this at the command line?

Thanks!
What have you tried to solve this problem?

I don't see anything in your description that explains why the transformations shown in red above happened. What input characters are supposed to be changed to spaces in the output? (The string "\n" is not a linefeed character, but it can be used in a format string to cause some programs to print a linefeed character.) What input characters are supposed to be deleted from the output?
This User Gave Thanks to Don Cragun For This Post:
# 4  
Old 05-04-2015
Thanks for the reply.

The data file is a large data file from the Federal government.

I would like to read the data using stat software packages.

The lines end with \r\n .

Field 4 is a text field. It can have a \n character embedded in it. Stat software package tends to incorrectly split a record with a \n character into two records. Commas are a legitimate part of field 4 (so I should not have had it removed in the sample output). Field 4 is all caps.

I have tried things like

Code:
awk -F"|" '{ print $1f"|"$4F }' < patienttest2.txt | sed 's/,/comma/g' | sed 's/|/,/' | sed 's/;/,/g' | awk -F , '{for (i=2;i<=NF;i++) if ($i>=0) print $1 FS $i}' | sed 's/co
mma/,/g' |

extracting field 1 (the id variable) and field 4 (the value of intereset) from a file

changing "," to "comma" because lowercase letters are not in the datafile

change the pipe to a comma.

the 2nd awk statement was to reshape fields 1 and 4 so that there a line with the id variable for each field 4.

the last sed is to swap "comma" out for "," .

Thanks!
# 5  
Old 05-05-2015
I don't see why you need to modify commas for what you seem to be trying to do.

How is the <newline> (or <carriage-return><newline>) incorrectly inserted in field 4 supposed to be modified? Should it/they be removed, replaced by a single space, or replaced by two spaces (as in your sample)?

Does the software that creates your input ever incorrectly insert <newline> characters in field 5? Does it ever incorrectly insert <newline> characters in fields 1, 2, or 3?

Can there be more than one <newline> character incorrectly inserted in field 4 for what should be a single input line?
This User Gave Thanks to Don Cragun For This Post:
# 6  
Old 05-05-2015
Thanks for the reply!

There is no need to modify the commas if the second awk statement could split the lines on the pipe characters.

The <newline> should be replaced by a single space.

Fields 1, 2, and 3 are numeric fields, so they never have <newline> characters.

Field 5 is a text field and can have <newline> characters incorrectly inserted.

Yes, fields 4 and 5 can each have multiple <newline> characters.

Thanks!
# 7  
Old 05-05-2015
If I understand your problem correctly, I don't see any need for anything but one awk script for this problem. Try:
Code:
awk '
BEGIN {	FS = OFS = "|"
}
{	while(NF < 5) {
		if(NF <= 1) {
			# Read a continuation line for field 5 or 1st line
			# of next record.
			if(getline != 1) {
				# Break out on EOF
				break
			}
		} else {# Read continuation line for field 4.
			if((getline x) != 1) {
				# We should not hit EOF in the middle of a
				# continued line, but check for it anyway.
				break
			}
			$0 = $0 " " x	# Replace incorrect <newline> with a
					# space.
			$1 = $1		# Reset NF after combining lines.
		}
	}
	# Discard <carriage-return>s.
	gsub(/\r/, "")
	n = split($4, sf, ";")
	for(i = 1; i <= n; i++)
		print $1, sf[i]
}' patienttest2.txt

If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk.

If patienttest2.txt contains:
Code:
ID1|VAR2|VAR3|VAR4|VAR5
ID2|VAR2|VAR3|PART1;PART2|1;2
ID3|VAR2|VAR3|A, B, C;PART2;BEFORE LF
AFTER LF|1;2;3
ID4|VAR2|VAR3|1;2;3,;4|1;2;3;4
ID5|VAR2|VAR3|1,
2;3
4;5
6|f5
f6
f7
ID6|VAR2|VAR3|A,b;C,d|a
con

(with <carriage-return><newline> line terminators or <newline> line terminators), produces the output:
Code:
ID1|VAR4
ID2|PART1
ID2|PART2
ID3|A, B, C
ID3|PART2
ID3|BEFORE LF AFTER LF
ID4|1
ID4|2
ID4|3,
ID4|4
ID5|1, 2
ID5|3 4
ID5|5 6
ID6|A,b
ID6|C,d

Does this match what you're trying to do?
This User Gave Thanks to Don Cragun For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Transpose large data in UNIX

Hi I have the following sample of data: my full data dimention is 900,000* 1119 rs987435 C G 1 1 1 0 2 rs345783 C G 0 0 1 0 0 rs955894 G T 1 1 2 2 1 rs6088791 ... (7 Replies)
Discussion started by: marwah
7 Replies

2. UNIX for Beginners Questions & Answers

Transpose the data

Hi All, I have sort of a case to transpose data from rows to column input data Afghanistan|10000|1 Albania|25000|4 Algeria|25000|7 Andorra|10000|4 Angola|25000|47 Antigua and Barbuda|25000|23 Argentina|5000|3 Armenia|100000|12 Aruba|20000|2 Australia|50000|2 I need to transpose... (3 Replies)
Discussion started by: radius
3 Replies

3. Shell Programming and Scripting

Help with transpose data content

Hi, Below is my input file: c116_g1_i1 -,-,-,+ c118_g2_i1 +,+ c118_g3_i1 + c120_g1_i1 +,+,+,+ . . Desired Output File c116_g1_i1 - c116_g1_i1 - c116_g1_i1 - c116_g1_i1 + c118_g2_i1 + c118_g2_i1 + (3 Replies)
Discussion started by: perl_beginner
3 Replies

4. Shell Programming and Scripting

Transpose data as rows using awk

Hi I have below requirement, need help One file contains the meta data information and other file would have the data, match the column from file1 and with file2 and extract corresponding column value and display in another file File1: CUSTTYPECD COSTCENTER FNAME LNAME SERVICELVL ... (1 Reply)
Discussion started by: ravlapo
1 Replies

5. Shell Programming and Scripting

Transpose Column of Data to Rows

I can no longer find my commands, but I use to be able to transpose data with common fields from a single column to rows using a command line. My data is separated as follows: NAME=BOB ADDRESS=COLORADO PET=CAT NAME=SUSAN ADDRESS=TEXAS PET=BIRD NAME=TOM ADDRESS=UTAH PET=DOG I would... (7 Replies)
Discussion started by: docdave78
7 Replies

6. Shell Programming and Scripting

Transpose Data from Columns to rows

Hello. very new to shell scripting and would like to know if anyone could help me. I have data thats being pulled into a txt file and currently have to manually transpose the data which is taking a long time to do. here is what the data looks like. Server1 -- Date -- Other -- value... (7 Replies)
Discussion started by: Mikes88
7 Replies

7. Shell Programming and Scripting

Transpose Daily Data from Column to Row.

Hi I'm looking to transpose Linux data from a daily report that logs every 10mins like below. After the first "comma" I need the daily total for Col2 and Col3 transposed like below. The new transposed format below will then be exported to Microsoft Excel for Reporting. Any help would be... (9 Replies)
Discussion started by: ravzter
9 Replies

8. Shell Programming and Scripting

Transpose columns to Rows : Big data

Hi, I did read a few posts on the subjects, tried out a few solutions, but did not solve my problem. https://www.unix.com/302121568-post11.html https://www.unix.com/shell-programming-scripting/137953-large-file-columns-into-rows-etc-4.html Please help. Problem very similar to the second link... (15 Replies)
Discussion started by: genehunter
15 Replies

9. Shell Programming and Scripting

How to transpose a table of data using awk

Hi. I have this data below:- v1 28 14 1.72414 1.72414 1.72414 1.72414 1.72414 v2 77 7 7.47126 6.89655 6.89655 6.89655 6.89655 v3 156 3 21.2644 21.2644 20.6897 21.2644 20.6897 v4 39 3 1.72414 1.72414 1.72414 1.72414 1.72414 v5 155 1 21.2644 23.5632 24.1379 23.5632 24.1379 v6 62 2 2.87356... (2 Replies)
Discussion started by: ahjiefreak
2 Replies

10. Shell Programming and Scripting

How to transpose data elements in awk

Hi, I have an input data file :- Test4599,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,2,2,Rain Test90,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,1,0,Not Rain etc.... I wanted to transpose these data to:-... (2 Replies)
Discussion started by: ahjiefreak
2 Replies
Login or Register to Ask a Question