selecting and deleting specific lines with condition

09-08-2011

Registered User

10, 0

Join Date: Nov 2010

Last Activity: 10 April 2013, 6:25 AM EDT

Posts: 10

Thanks Given: 9

Thanked 0 Times in 0 Posts

selecting and deleting specific lines with condition

I have a set of data as below:

Quote:

HBOND SUMMARY
output to file HB_lowLyo_D_lipid_A_water_001_064.tbl,
data was sorted, intra-residue interactions are NOT included,
Distance cutoff is 4.00 angstroms, angle cutoff is 120.00 degrees
Hydrogen bond information dumped for occupancies > 0.00

DONOR ACCEPTORH ACCEPTOR
atom# res@atom atom# res@atom atom# res@atom %occupied distance angle
| 4645 58@O12 | 23489 1174@H1 23488 1174@O | 22.79 2.945 ( 0.28) 26.79 (14.41)
| 4645 58@O12 | 23490 1174@H2 23488 1174@O | 22.49 2.965 ( 0.31) 28.01 (14.47)
| 2701 34@O12 | 23333 1122@H1 23332 1122@O | 20.60 2.965 ( 0.23) 30.07 (14.18)
| 2701 34@O12 | 23334 1122@H2 23332 1122@O | 19.74 2.963 ( 0.23) 31.43 (13.88)
| 271 4@O12 | 23334 1122@H2 23332 1122@O | 19.70 2.825 ( 0.19) 21.92 (12.15)
| 271 4@O12 | 23333 1122@H1 23332 1122@O | 19.55 2.826 ( 0.19) 22.22 (12.71)
| 4655 58@O16 | 21156 396@H2 21154 396@O | 19.43 2.933 ( 0.22) 31.95 (15.18)
| 4658 58@O15 | 21156 396@H2 21154 396@O | 18.96 3.163 ( 0.27) 37.03 (14.63)
| 4310 54@O26 | 23202 1078@H2 23200 1078@O | 18.73 2.821 ( 0.24) 25.87 (13.92)
| 4655 58@O16 | 21155 396@H1 21154 396@O | 18.63 2.917 ( 0.22) 31.91 (15.00)
| 1820 23@O16 | 21167 400@H1 21166 400@O | 18.14 2.910 ( 0.22) 27.20 (13.87)
| 1820 23@O16 | 21168 400@H2 21166 400@O | 17.96 2.907 ( 0.21) 26.69 (13.86)
| 3845 48@O16 | 23454 1162@H2 23452 1162@O | 17.68 2.991 ( 0.31) 28.45 (14.88)
| 4658 58@O15 | 21155 396@H1 21154 396@O | 17.31 3.177 ( 0.27) 38.82 (14.69)
| 3845 48@O16 | 23453 1162@H1 23452 1162@O | 17.29 3.016 ( 0.32) 28.84 (14.57)
| 1489 19@O13 | 23201 1078@H1 23200 1078@O | 16.66 2.884 ( 0.23) 31.39 (15.56)
| 3824 48@O26 | 21099 377@H2 21097 377@O | 15.44 2.992 ( 0.30) 30.78 (15.01)
| 4253 53@O15 | 23454 1162@H2 23452 1162@O | 14.98 2.961 ( 0.27) 33.71 (15.09)
| 1459 19@O22 | 23201 1078@H1 23200 1078@O | 14.84 3.012 ( 0.33) 35.08 (16.12)
| 1081 14@O12 | 21173 402@H1 21172 402@O | 14.76 2.937 ( 0.24) 27.54 (14.26)
| 4253 53@O15 | 23453 1162@H1 23452 1162@O | 14.63 2.955 ( 0.25) 33.68 (15.11)
| 1081 14@O12 | 21174 402@H2 21172 402@O | 14.41 2.944 ( 0.25) 28.34 (14.35)
| 3824 48@O26 | 21098 377@H1 21097 377@O | 13.70 3.002 ( 0.30) 31.00 (15.21)
| 3845 48@O16 | 21156 396@H2 21154 396@O | 13.06 2.934 ( 0.26) 27.71 (14.05)
.
.
.
few thousand lines

The first field, $1 represent "|".
The $3 (3rd field) and $6 (6th field) in my data file represent "number-molecule" which has arrangement as below:

HTML Code:

   1    2   3   4   5   6   7       8

   9    10  11  12  13  14  15      16
  17    18  19  20  21  22  23      24
  25    26  27  28  29  30  31      32
  33    34  35  36  37  38  39      40
  41    42  43  44  45  46  47      48
  49    50  51  52  53  54  55      56

  57    58  59  60  61  62  63      64

Any pairs made from above numbers actually represents pairs in the 3rd and 6th field of each line in the data file.

What I want is to select the pairs from the data file made only by the numbers which are arranged at the outer most lines of the above number-molecule ordering.

In short, ANY PAIRS made by only the numbers

HTML Code:

 (1 2 3 4 5 6 7 8   57 58 59 60 61 62 63 64   9 17 25 33 41 49 57   8 16 24 32 40 48 56 64)

in other words

1 , 2
1 , 3
1 , 4
.
.
1 , 57
1 , 58
1 , 59
.
.
.
2, 1
2, 3
2, 4
2, 5
.
.
.
2, 57
2, 58
2, 59
.
.
.

are need to be deleted from the data file.

To achieve this I have tried to write awk script as below to test to print out the line which I suppose to delete. But at this level I fail to select those line pairs.

Code:

 #!/usr/bin/awk -f

 BEGIN  {
   i=0
   for (n=1; n<=8; n++) set[i++] = n;
   for (n=57; n<=64; n++) set[i++] = n;
   for (n=9; n<=49; n+=8) {set[i++] = n; set[i++] = n+7};
    }


 ($1== "|") {
     split($3, res1, "@"); split($6, res2, "@"); #print res1[1], res2[1]

     if ( (res1[1] in set) == (res2[1] in set) ); 

     {
       print;
      }

 }

Can I get any help to resolve this needs?

Thanks in advance

vjramana

View Public Profile for vjramana

Find all posts by vjramana

09-08-2011

Registered User

4,673, 588

Join Date: Oct 2010

Last Activity: 1 February 2016, 3:35 PM EST

Location: Southern NJ, USA (Nord)

Posts: 4,673

Thanks Given: 8

Thanked 588 Times in 561 Posts

Generally, by delete one means make a new file without. I am a bit confused in trying to see the objective. It may be a multi-pass project, to collect information, rearrange it to decide what to do, and then apply those results to the original. Where did it get hard?

DGPickett

View Public Profile for DGPickett

Find all posts by DGPickett

09-08-2011

Registered User

1,000, 237

Join Date: Jun 2011

Last Activity: 2 August 2017, 9:27 AM EDT

Location: From far

Posts: 1,000

Thanks Given: 21

Thanked 237 Times in 231 Posts

I'm afraid it's hard to understand your problem. May you give expected or deleted rows (and please, one more time - why?) in this test set:

Code:

awk '/^\|/ {print $3, $6}' INPUTFILE
58@O12 1174@H1
58@O12 1174@H2
34@O12 1122@H1
34@O12 1122@H2
4@O12 1122@H2
4@O12 1122@H1
58@O16 396@H2
58@O15 396@H2
54@O26 1078@H2
58@O16 396@H1
23@O16 400@H1
23@O16 400@H2
48@O16 1162@H2
58@O15 396@H1
48@O16 1162@H1
19@O13 1078@H1
48@O26 377@H2
53@O15 1162@H2
19@O22 1078@H1
14@O12 402@H1
53@O15 1162@H1
14@O12 402@H2
48@O26 377@H1
48@O16 396@H2

yazu

View Public Profile for yazu

Find all posts by yazu

09-08-2011

Registered User

10, 0

Join Date: Nov 2010

Last Activity: 10 April 2013, 6:25 AM EDT

Posts: 10

Thanks Given: 9

Thanked 0 Times in 0 Posts

I shall give another set of data for clarity purpose.

Quote:

DONOR ACCEPTORH ACCEPTOR
atom# res@atom atom# res@atom atom# res@atom %occupied distance angle
| 4726 59@O12 | 1487 19@H12 1486 19@O12 | 85.66 2.819 ( 0.18) 21.85 (12.11)
| 1499 19@O15 | 1730 24@H12 1729 24@O12 | 83.15 3.190 ( 0.31) 22.36 (12.73)
| 1216 16@O22 | 1460 17@H22 1459 17@O22 | 75.74 2.757 ( 0.14) 24.55 (13.66)
| 4232 53@O25 | 4143 52@H24 4142 52@O24 | 74.35 2.916 ( 0.25) 28.27 (13.26)
| 3683 46@O16 | 4163 52@H13 4162 52@O13 | 73.78 2.963 ( 0.29) 23.65 (14.14)
| 4162 52@O13 | 4079 51@H12 4078 51@O12 | 73.68 2.841 ( 0.19) 21.25 (11.87)
| 3764 47@O16 | 3825 48@H26 3824 48@O26 | 70.52 2.973 ( 0.28) 26.88 (13.14)
| 193 3@O13 | 353 5@H12 352 5@O12 | 67.49 2.780 ( 0.17) 17.85 (10.90)
| 3035 38@O16 | 3350 42@H12 3349 42@O12 | 67.19 2.790 ( 0.16) 18.72 (10.47)
| 686 9@O16 | 893 12@H22 892 12@O22 | 66.87 2.905 ( 0.22) 26.53 (10.90)
| 1478 19@O25 | 1703 22@H22 1702 22@O22 | 64.37 2.864 ( 0.21) 31.87 (14.12)
| 3521 44@O16 | 747 10@H26 746 10@O26 | 63.71 2.941 ( 0.27) 26.82 (13.51)
| 1313 17@O26 | 1217 16@H22 1216 16@O22 | 63.09 2.807 ( 0.16) 22.23 (11.92)
| 4159 52@O12 | 3684 46@H16 3683 46@O16 | 62.43 2.900 ( 0.22) 35.69 (12.23)
| 4331 54@O16 | 1490 19@H13 1489 19@O13 | 61.80 2.989 ( 0.29) 26.58 (14.32)
| 3440 43@O16 | 3906 49@H26 3905 49@O26 | 60.17 2.964 ( 0.28) 28.61 (13.24)
| 1334 17@O16 | 1247 16@H13 1246 16@O13 | 59.31 2.828 ( 0.18) 25.35 (12.61)
| 1729 22@O12 | 1557 20@H26 1556 20@O26 | 58.11 3.036 ( 0.27) 32.81 (11.84)
| 4151 52@O25 | 4484 56@H12 4483 56@O12 | 57.67 2.917 ( 0.32) 27.71 (15.02)
| 1502 19@O11 | 1730 22@H12 1729 22@O12 | 57.53 3.184 ( 0.26) 41.62 (13.24)
| 3014 38@O26 | 3353 42@H13 3352 42@O13 | 57.42 2.884 ( 0.24) 22.59 (12.87)
| 3524 44@O15 | 3917 49@H12 3916 49@O12 | 57.35 3.227 ( 0.35) 25.52 (13.61)
| 2390 30@O15 | 2756 35@H22 2755 35@O22 | 57.28 3.074 ( 0.33) 31.27 (14.44)
| 1739 22@O16 | 5115 64@H24 5114 64@O24 | 56.78 2.876 ( 0.28) 20.94 (13.42)
| 4574 57@O16 | 5061 63@H16 5060 63@O16 | 56.57 2.956 ( 0.25) 30.52 (14.00)
| 2846 36@O24 | 3566 45@H22 3565 45@O22 | 55.92 2.880 ( 0.24) 22.85 (12.39)
| 605 8@O16 | 839 11@H12 838 11@O12 | 55.67 2.894 ( 0.24) 25.45 (13.25)

If you notice the first line field 3 ($3), the residue number is 59 and in filed 6, the residue number is 19. Number 59 is in the outer most line and 19 is not according to the number-molecule arrangement. So this line should NOT be deleted.

If you notice the second line, field 3 ($3), the number 19 and in filed 6 ($6) the number is 24. Number 19 is not in the outer most line but number 24 is in the outer most line. This line also should not be deleted since NOT both the numbers are in the outer most lines.

If you notice the third line, field 3 ($3), the number is 16 and filed 6 ($6) the number is 17. Since both the numbers in this pair belongs to the outer most numbers, then this line should be deleted.

So after testing the criteria of the numbers to be in the outer most lines then that line should be deleted. This is what I need to achieve and this code simply does not work as I wanted.

Thanks in advance.

vjramana

View Public Profile for vjramana

Find all posts by vjramana

09-09-2011

Registered User

4,673, 588

Join Date: Oct 2010

Last Activity: 1 February 2016, 3:35 PM EST

Location: Southern NJ, USA (Nord)

Posts: 4,673

Thanks Given: 8

Thanked 588 Times in 561 Posts

So, This is a negative join. Semms like his approach should be good: you need to save the outer numbers in an array, and then as you go through the lines, look them up and decide if you want to copy. You could use while read in ksh/bash and put @ in IFS to split that field into two. You could decide each number's row mathematically (( (N%8) < 2 )).

What about when field 6 and 8 do not match? No different?

Last edited by DGPickett; 09-09-2011 at 06:02 PM..

DGPickett

View Public Profile for DGPickett

Find all posts by DGPickett

09-09-2011

Registered User

139, 15

Join Date: Jan 2009

Last Activity: 19 September 2016, 9:39 AM EDT

Posts: 139

Thanks Given: 13

Thanked 15 Times in 13 Posts

Seems to me that you could use modulus to simplify the tests.

Your number x (assumed to be less than or equal to 64?)

if x % 8 = 0 it's in the right hand column
if x % 8 = 1 it's in the left hand column

then you just have the ranges

2<=x<=7

and

58<=x<=63

---------- Post updated at 05:31 PM ---------- Previous update was at 05:03 PM ----------

Ahhh...just realized, this test is wrong!

Code:

if ( (res1[1] in set) == (res2[1] in set) );

You can't test for the value of the array element this way, only that the subscript exists!

You could set your array differently instead of using set[i++] why not
break up the array and use set[n]? Then your test should work as you would have elements as follows:

Code:

set[1] through set[8]
set[9], set[16]
set[17], set[24]
set[25], set[32]
set[33], set[40]
set[41], set[48]
set[49], set[56]
set[57] through set[64]

Two other things I noticed...remove the semicolon after the test
and change "==" to &&

Code:

if ( (res1[1] in set) && (res2[1] in set) );

This is the biggest problem your script had other than trying to
use "in" to test the set values instead of the subscripts.

This code worked for me.

Code:

#!/usr/bin/awk -f

 BEGIN  {
   i=0
   for (n=1; n<=8; n++) set[n] = n;
   for (n=9; n<=49; n+=8) {
     set[n] = n 
     set[n+7] = n+7 
   };
   for (n=57; n<=64; n++) set[n] = n;
 }

 ($1 == "|") {
     split($3, res1, "@"); split($6, res2, "@");
     if ( (res1[1] in set) && (res2[1] in set) ) # <--- no ';' here!
     {
       print;
     }

 }

Quote:

# ./udc1.awk < udc1.txt
| 1216 16@O22 | 1460 17@H22 1459 17@O22 | 75.74 2.757 ( 0.14) 24.55 (13.66)
| 193 3@O13 | 353 5@H12 352 5@O12 | 67.49 2.780 ( 0.17) 17.85 (10.90)
| 1313 17@O26 | 1217 16@H22 1216 16@O22 | 63.09 2.807 ( 0.16) 22.23 (11.92)
| 1334 17@O16 | 1247 16@H13 1246 16@O13 | 59.31 2.828 ( 0.18) 25.35 (12.61)
| 4574 57@O16 | 5061 63@H16 5060 63@O16 | 56.57 2.956 ( 0.25) 30.52 (14.00)

udc1.txt was both your first and second examples put together in that order.

Last edited by rwuerth; 09-09-2011 at 07:27 PM..

This User Gave Thanks to rwuerth For This Post:

rwuerth

View Public Profile for rwuerth

Find all posts by rwuerth

09-11-2011

Registered User

10, 0

Join Date: Nov 2010

Last Activity: 10 April 2013, 6:25 AM EDT

Posts: 10

Thanks Given: 9

Thanked 0 Times in 0 Posts

Dear sir,

Thanks so much for your kind reply. The code perfectly works now as per my need. But additionally I want to ask you something related to this. At the end of the code I write "print" so that I want to see if the code selecting the lines which I dont want exactly. Now if I want to delete those selected lines, what command should I should use?

vjramana

View Public Profile for vjramana

Find all posts by vjramana

Shell Programming and Scripting

selecting and deleting specific lines with condition

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Deleting lines based on a condition for a group of files

Discussion started by: anurupa777

2. Shell Programming and Scripting

Deleting specific lines in a file

Discussion started by: asanchez

3. Shell Programming and Scripting

deleting specific lines in a file

Discussion started by: asanchez

4. Shell Programming and Scripting

Deleting specific lines in a file

Discussion started by: asanchez

5. Shell Programming and Scripting

deleting specific lines in a file

Discussion started by: verge

6. Shell Programming and Scripting

Shell deleting specific lines

Discussion started by: salbanito

7. Shell Programming and Scripting

Selecting specific 'id's from lines and columns using 'SED' or 'AWK'

Discussion started by: kamskamu

8. UNIX for Dummies Questions & Answers

command for selecting specific lines from a script

Discussion started by: gardasgangadhar

9. UNIX for Dummies Questions & Answers

Help with selecting specific lines in a large file

Discussion started by: tansha

10. Shell Programming and Scripting

Deleting specific lines in a file

Discussion started by: ramu_1980