Help reformatting column


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Help reformatting column
# 1  
Old 02-05-2017
Help reformatting column

Hello UNIX experts,

I'm stumped finding a method to reformat a column. Input file is a two column tab-delimited file. Essentially, for every term that appears in column 2, I would like to summarize whether that term appears for every entry in column 1. In other words, make a header for each term that appears in column 2, and then mark whether that appears for each entry in column 1. Somewhat complicated to explain, so I'll give an example:

Code:
$ cat input.txt
sample1	workflow1;workflow2;workflow3;workflow4;workflow5
sample2	workflow2;workflow3;workflow1;workflow5
sample3	workflow3;workflow1;workflow4;workflow5;workflow2
sample4	workflow3;workflow1;workflow4;workflow5;workflow2
sample5	workflow3;workflow1;workflow4;workflow5;workflow2
sample6	workflow1
sample7	workflow8

Code:
$ cat output.txt
Sample	workflow1	workflow2	workflow3	workflow4	workflow5	workflow8	
sample1	X	X	X	X	X	NA
sample2	X	X	X	NA	X	NA
sample3	X	X	X	X	X	NA
sample4	X	X	X	X	X	NA
sample5	X	X	X	X	X	NA
sample6	X	NA	NA	NA	NA	NA
sample7	NA	NA	NA	NA	NA	X

# note: X is "yes, this workflow exists for this sample"

The values "workflow1", "workflow2" etc...may contain special characters such as underscores, hyphens, colons etc... (ie. workflow-1, work:flow1), so matches must be exact. I wrote a working bash code, but it is terribly slow and poorly written. Right now, a 2,000 line file takes 20 minutes because my code sucks and runs through loops! For reference, my strategy:

Code:
# for each sample and workflow in $file, check whether workflow exists by through grep, then output a file with X or NA.
for SAMPLE in `cat $file.samples`; # column 1 from $file
 do
  echo "Preparing $SAMPLE"
  for WORKFLOW in `cat $file.workflows`; # all the unique terms from column 2
   do
    CHECK=`grep -w "$SAMPLE" $file | grep -oP ";${WORKFLOW};" | sed "s/\;//g"`;
    if [ "${CHECK}" == "${WORKFLOW}" ]; then
     echo "X" > $file.$SAMPLE.$WORKFLOW.reply; else echo "NA" >$file.$SAMPLE.$WORKFLOW.reply
    fi
   done;
  echo $SAMPLE > $file.sample;
  paste $file.sample $file.$SAMPLE.*.reply >> $file.sum.txt;
  rm $file.sample $file.$SAMPLE.*.reply;
 done

Many many thanks in advance!!

Torch
# 2  
Old 02-05-2017
Try this awk script:

Code:
awk '
{
  values=split($2,V,";")
  sample[NR]=$1
  for(i=1;i<=values;i++) {
      if(!(V[i] in CI)) {
         CI[V[i]]
         CH[++col]=V[i]
      }
      CNT[NR,V[i]]
  }
}
END {
   printf "Sample"
   for(i=1;i in CH;i++) printf "\t%s", CH[i]
   for(ln=1;ln<=NR;ln++) {
      printf "\n%s",sample[ln]
      for(i=1;i in CH;i++)
         printf "\t%s", ln SUBSEP CH[i] in CNT ? "X" : "NA"
   }
   printf "\n"
}' input.txt

These 2 Users Gave Thanks to Chubler_XL For This Post:
# 3  
Old 02-05-2017
Quote:
Originally Posted by Chubler_XL
Try this awk script:

Code:
awk '
{
  values=split($2,V,";")
  sample[NR]=$1
  for(i=1;i<=values;i++) {
      if(!(V[i] in CI)) {
         CI[V[i]]
         CH[++col]=V[i]
      }
      CNT[NR,V[i]]
  }
}
END {
   printf "Sample"
   for(i=1;i in CH;i++) printf "\t%s", CH[i]
   for(ln=1;ln<=NR;ln++) {
      printf "\n%s",sample[ln]
      for(i=1;i in CH;i++)
         printf "\t%s", ln SUBSEP CH[i] in CNT ? "X" : "NA"
   }
   printf "\n"
}' input.txt

A beautiful solution! Appears to work as expected, and is ~2000x faster then grep'ing to disk a million times.

Thanks very much!
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Need help in reformatting the input

Hi i want to print line which is mentioned as below 615213:1;20150725;20250722;0|11;20150831;20150831;100|14;20150725;20160723;2 in below format. ' 615213: 1;20150725;20250722;0 615213: 11;20150831;20150831;100 615213: 14;20150725;20160723;2 please help me and suggest me how to... (9 Replies)
Discussion started by: scriptor
9 Replies

2. UNIX for Dummies Questions & Answers

Help reformatting input file

Hi, I have an input file that looks like this (columns are tab delimited: Data000005-RA GO:0003735 GO:0005840 GO:0006412 Data000005-RA GO:0003735 Data000009-RA GO:0003735 GO:0005622 GO:0005840 GO:0006412 ... (2 Replies)
Discussion started by: Fahmida
2 Replies

3. Shell Programming and Scripting

Reformatting Column into rows

I have a file that I need to reformat so that every time I match a certain string in the first column it prints to the string as the heading and under the sting it prints the remaining entries on the line that matched the string. For example, I need to reformat this xxx : yyy zzz 11 : 111 222... (4 Replies)
Discussion started by: kieranfoley
4 Replies

4. UNIX for Dummies Questions & Answers

Date reformatting

I have been reformatting dates from a data file to make them mysql compliant. 31-10-2011 Loc1 1-11-2011 Loc2 The first can be captured by this: sed -i '' -e "s#\(..\)-\(..\)-\(....\)#\3-\2-\1#" data.txt and leads to: 2011-10-31 Loc1 The second line is captured as follows: sed -i... (2 Replies)
Discussion started by: figaro
2 Replies

5. Shell Programming and Scripting

Reformatting single column text file starting new line when finding particular string

Hi, I have a single colum file and I need to reformat the file so that it creates a new line every time it come to an IP address and the following lines are corresponding rows until it comes to the next IP address. I want to turn this 172.xx.xx.xx gwpusprdrp02_pv seinwnprd03... (7 Replies)
Discussion started by: kieranfoley
7 Replies

6. UNIX for Dummies Questions & Answers

Date reformatting

I have a file with temperature measurements: Loc1,20090102,71.55 Loc1,20090103,71.65 Loc1,20090104,71.55 Loc1,20090105,71.54 Loc1,20090106,71.54 However, to load this into a database I would like to reformat the dates (column 2) from the yyyymmdd format to the yyyy-mm-dd format. I have... (2 Replies)
Discussion started by: figaro
2 Replies

7. UNIX for Dummies Questions & Answers

Date reformatting

I currently have the following file containing sample values for a number of dates: Loc1 04 Jan 2007 0.95 0.9532 Loc1 05 Jan 2007 0.95 0.9513 Loc1 06 Jan 2007 0.95 0.9535 This continues for all months of the year and spans across several years. I am trying to reformat the dates so that... (2 Replies)
Discussion started by: figaro
2 Replies

8. Shell Programming and Scripting

Help reformatting output

I have a command that gives me the output below: JAVA_HOME = C:/jdk1.5.0_11 Broker Performance Report for server 'app1' RMI_URL = rmis:// Parameter Kintana ItgDS DashboardDS ---------------------------- ------- ----- ----------- Connections count 41 ... (4 Replies)
Discussion started by: bwiebe
4 Replies

9. UNIX for Dummies Questions & Answers

Reformatting file

Hi, How can I reformat a file (text file) using unix command. This file was FTP'd from Mainframe and contains some garbage character at the end of each line. Each line contains special characters '<soh>' at the end which should have been spaces when I view it in emacs or nedit. I couldnt do find... (2 Replies)
Discussion started by: mrjunsy
2 Replies

10. Filesystems, Disks and Memory

reformatting a floppy!

i am trying to reformat a floppy i am using solaris 9 when i run this: rmformat -F quick /vol/dev/aliases/floppy0 it tells me that it cannot perform the operation on a mounted device. how do i unmount the device and format the floppy? (1 Reply)
Discussion started by: rmuhammad
1 Replies
Login or Register to Ask a Question