Check group consistencies


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers Check group consistencies
# 1  
Old 01-09-2015
Check group consistencies

hello masters ,

please help here. I have 4 cols, I am looking for consistent 'geno' values within
'line', 'part' combinations. If the geno values are not consistent within a 'line', 'part' block, then we delete that block. One of the complications is that geno values are always 2 character, but AT and TA mean the same thing,similarly CG=GC etc.

I have 2 requests

1) Convert all 'geno' values in the whole dataset like AT and TA to either one or the other. I dont care which one, as long as they are consistent across the whole dataset.
2) Delete inconsistent 'line', 'part' combinations (red) from the data. If it is consistent (green) then just keep any one, in the example I have kept the first one.

Code:
line part serial geno
ax211 part1 1234 AA
gf345  part1 1345 TT
gf345  part1  3456 AA
gf345  part1 1346 TT
ax211 part2 1834 AA
gf345  part2 1395 TT
gf345  part2  3656 AA
gf345  part2 13746 TT
ax211 part3 1634 AA
gf345  part3 13345 AT
gf345  part3  34256 TA
gf345  part3 13446 AT

expected output

Code:
line part serial geno
ax211 part1 1234 AA
ax211 part2 1834 AA
ax211 part3 1634 AA
gf345  part3 13345 AT

For the first part I can try the following

Code:
sed "s/TA/AT/g;s/TC/CT/g;s/TG/GT/g;s/GC/CG/g;s/GA/AG/g;s/CA/AC/g;" file

but is there a smarter way to do this?

for the second part this is what I tried,

Code:
awk     'FNR==NR        {a[$1$2]+=$4;  b[$1$2]=$4;next}
                    $1$2 in b  {if (a[$1$2] ==1 ) print $0 }
        ' file file

# 2  
Old 01-09-2015
Being a bit short on time i will deal only with the first part:

Quote:
Originally Posted by ritakadm
1) Convert all 'geno' values in the whole dataset like AT and TA to either one or the other. I dont care which one, as long as they are consistent across the whole dataset.
For the first part I can try the following

Code:
sed "s/TA/AT/g;s/TC/CT/g;s/TG/GT/g;s/GC/CG/g;s/GA/AG/g;s/CA/AC/g;" file

but is there a smarter way to do this?
No, your solution is good. It is possible to do the same in awk, though, and if you want to use awk anyway for the second part of your requirement you might want to do all in one pass instead of doing something like sed '....' file | awk '....'. This, of course, is only the case if your two requirements are always two parts of one step always done together. If not, this point is moot.

I would make the regexp a bit more robust, though (here just the first rule as example):
Code:
sed "s/TA/AT/g" file

This will change every "TA" to "AT". But i fact you are interested only in changing the "TA" at the end of the line. This will also improve your code because accidental changes somewhere in the middle of a line can't happen any more. Therefore:

Code:
sed "s/TA$/AT/g" file

For the same reason you can drop the "g" option of the substitution: you want to change only one "TA" per line (the one immediately before the line end), hence:

Code:
sed "s/TA$/AT/" file

Finally, as long as it is possible you should enclose sed-expressions in single quotes instead of double quotes. In your special case it doesn't matter but some meta-characters are expanded within double-quotes whereas in single quotes nothing is expanded at all. Therefore, finally:

Code:
sed 's/TA$/AT/' file

One last tip: i usually write sed-scripts line per line, it is easier to read this way, IMHO:

Code:
sed 's/TA$/AT/
     s/TC$/CT/
     s/TG$/GT/
     s/GC$/CG/
     s/GA$/AG/
     s/CA$/AC/' file

I hope this helps.

bakunin

Last edited by bakunin; 01-09-2015 at 09:43 PM..
This User Gave Thanks to bakunin For This Post:
# 3  
Old 01-09-2015
thanks, very helpful explanation ! .. I was wondering if there is a shortcut to reverse matching the 'geno' values without using specific sed arguments..like converting all generic xy$ and yx$ values to xy ... waiting on some part 2 help Smilie
# 4  
Old 01-11-2015
Is your file sorted? How many rows?

---------- Post updated 01-11-15 at 03:50 PM ---------- Previous update was 01-10-15 at 10:40 PM ----------

A very crude solution, maybe the experts can help with a better one, would be problematic if you file is huge..worth a try

Code:
awk     'FNR==NR {a[$1" "$2]++;  
          b[$1" "$2" "$4]++;
  next}
                    $1" "$2 in a{
   $4 = ( a[$1" "$2] == b[$1" "$2" "$4] ) ? $4 : "multiple"; 
   delete a[$1" "$2] ;
   delete b[$1" "$2" "$4] ;
   print ;
   }' file file | awk '$4!="multiple"'

This User Gave Thanks to senhia83 For This Post:
# 5  
Old 01-11-2015
Another way:
Code:
awk '
  NR==1{
    print
    next
  }
  {
    i=$1 SUBSEP $2
    if(i in A) {
      if(B[i]!=$4 && C[i]!=$4) N[i]
    }
    else {
      A[i]=$0
      B[i]=$4
      C[i]=substr($4,2) substr($4,1,1)
    }
  }
  END {
    for(i in A)
      if(!(i in N))
        print A[i]
  }
' file

This User Gave Thanks to Scrutinizer For This Post:
# 6  
Old 01-12-2015
Hi senhia83,

thank you, your script may be problematic since the file is being read twice into memory..

Hi Scrutinizer, will your script work on unsorted data? My file is 23Gb and when I try to use

Code:
sort -k2,2 -k1,1

I`m getting

Code:
sort: write failed: /tmp/sortmcTb: No space left on device




any alternative suggestions on how to prepare the data?



Also, I do t have a header line , should I just ignore the part?

Code:
NR==1{
    print
    next
  }


Last edited by ritakadm; 01-13-2015 at 12:13 AM..
# 7  
Old 01-13-2015
Yes, Scrutinizer's script works with unsorted data.
Yes, this section prints the 1st line then jumps to the next line.
The script takes lots of memory that is processed in the END section, after reading the whole file.
Senhias script reads the file twice that takes time, but maybe takes less memory.

---------- Post updated at 01:14 PM ---------- Previous update was at 12:24 PM ----------

The following is a little improvement regarding memory consumption and speed:
Code:
awk '
{
  i=$1 FS $2
  if (i in A) {
    if (!((i,$4) in B)) {A[i]=""}
  } else {
    A[i]=$0
    B[i,$4]; B[i,substr($4,2,1) substr($4,1,1)]
  }
}
END {for (i in A) if (A[i]!="") print A[i]}
' file

Doesn't need a sorted file.

---------- Post updated at 01:59 PM ---------- Previous update was at 01:14 PM ----------

The following uses minimum memory but is a little slower:
Code:
awk '
{
  i=$1 FS $2
  if (i in A) {
    if (B[i]!=$4 && B[i]!=substr($4,2) substr($4,1,1)) A[i]=""
  } else {
    A[i]=$3
    B[i]=$4
  }
}
END {for (i in A) if (A[i]!="") print i,A[i],B[i]}
' file


Last edited by MadeInGermany; 01-13-2015 at 02:20 PM..
This User Gave Thanks to MadeInGermany For This Post:
 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

How to check when a group was removed for an id?

An id was a part of a particular user group. All of a sudden that id was removed from that group, because of which many things failed. How do I find out when/who modified the id settings? (2 Replies)
Discussion started by: ggayathri
2 Replies

2. UNIX for Dummies Questions & Answers

Check users in a Linux group

How do you check users in a linux group? (7 Replies)
Discussion started by: cokedude
7 Replies

3. Shell Programming and Scripting

How to check number of group of file.?

Hi Gurus, I need check existing number of file based on the list in file list. for example: in my file list. I have below: abc, file1.txt abc, file2.txt abc, file3.txt abc, file4.txt cde, filea1.txt cde, filea2.txt cde, filea3.txt ... in my current file direcotry, I have file:... (0 Replies)
Discussion started by: ken6503
0 Replies

4. AIX

Check status of a volume group

Hi huys, Sorry for my bad english, i'm french :o . I've got a little question : is there a way to check status of a VG on aix 6.1 ? I want to know if a VG is locked or not... I can do a "lsvg -Ll rootvg" for example, but if this VG is already locked, the process waits without gives me the... (2 Replies)
Discussion started by: akorx
2 Replies

5. Shell Programming and Scripting

Checking file consistencies

Hi All, I am stuck with a problem here. I have two directories with really huge number of files about 200000+. I did some file processing and in between my program crashed thereby creating some inconsistent files. Running the script over again is out of question now as it takes lot of time to... (1 Reply)
Discussion started by: shoaibjameel123
1 Replies

6. Red Hat

Check disks not in a volume group?

Hello, How can I obtain a lists of disks with their size (anytype: SAN LUNs, internal disks, etc.) attached to the system and not being extended inside a volume group? The purpose of this list is to be part of a function of a script that I'm doing in order to resize filesystems and in the... (6 Replies)
Discussion started by: asanchez
6 Replies

7. AIX

Check quorum for volume group

Hi all, I would like to ensure that a volume group has an effective quorum setting of 1 (or off). I know you can change the quorum setting using the chvg -Q command but want to know if the setting has been changed before the vg was varied on or a reboot. In other words how can I ensure that... (3 Replies)
Discussion started by: backslash
3 Replies

8. Shell Programming and Scripting

How to check if a user belongs to a group (KSH)?

Hi all, How can I check if a particular user id belongs to a group? (ie. how to check if the current user `whoami` is part of the a certain group? do i use the group name of group id?) Thanks in advance (2 Replies)
Discussion started by: rockysfr
2 Replies

9. UNIX for Dummies Questions & Answers

UNIX log to check group creator?

Is there a log or command in unix to check who created a user group? Thanks in advance (3 Replies)
Discussion started by: newbit
3 Replies

10. UNIX for Advanced & Expert Users

How to check size of Volume Group

Did anyone knows how to check size/usage of a Volume Group in AIX 4.3.3? (4 Replies)
Discussion started by: s_aamir
4 Replies
Login or Register to Ask a Question