Help in awk/bash

01-05-2013

Registered User

50, 0

Join Date: Dec 2012

Last Activity: 12 August 2013, 3:07 AM EDT

Posts: 50

Thanks Given: 52

Thanked 0 Times in 0 Posts

I have uploaded a part of first file and full second file. I have posted real values for second file, but first file is very big.

Last edited by bioinfo; 01-05-2013 at 04:54 PM..

bioinfo

View Public Profile for bioinfo

Find all posts by bioinfo

01-05-2013

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by RudiC

As much as I want to help, I am sorry I have to say I can't. Thank you for the effort explaining your input in detail, but post #4 does not relate to post #1 by no means. E.g. group No. 10 being centered at 052 here and 051 there, having 31 branches here and 30 there, groups showing up here not showing up there and vice versa, and, groups in file2 not being represented in file 1.
On top, I still can't see what pattern to fill in (see my post #2), where to get it, based on what rule, even if I take file g.txt to be a distilled version of file1 and file2.
It would be helpful if you post a minimum number of input files (e.g. atoms.txt and g.txt) with interrelating data, an output file and a set of understandable rules on how to get one into the other.

You said you had two two files: atom.txt and g.txt. I am assuming that atom.txt is in the same format as 11.txt in your last thread with the same title as this thread. You have not given us anything that includes even a single complete line (after the header line) from the file g.txt. And, you have not shown us what you want to appear in G10.txt, and any other G*.txt file that we can match against what you have shown us from atom.txt.

With the data you gave us in message #4 in this thread, the First file gives us an indication of what might appear in g.txt for groups 0 through 5, but none of them are listed in g.txt in message $1 nor in Second file in message #4 in this thread.

If you don't give us coherent sample data so we can put together with sample output that matches the sample data you give us, it is EXTREMELY hard to figure out what you want. I think I'm close to figuring out what you want done and expect to post something later this afternoon. But, I have no confidence that it be be at all close to what you want because the specification of what you want is so vague. And, you haven't given us sample input and output that we can use to determine if a possible solution we might develop does what you want done.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

01-05-2013

Registered User

50, 0

Join Date: Dec 2012

Last Activity: 12 August 2013, 3:07 AM EDT

Posts: 50

Thanks Given: 52

Thanked 0 Times in 0 Posts

Quote:

You said you had two two files: atom.txt and g.txt. I am assuming that atom.txt is in the same format as 11.txt in your last thread with the same title as this thread.

Yes, atom.txt is same as 11.txt. While posting in new thread I just used new name

. I am explaining my problem again with more details and concise data. I have two files atom.txt (or 11.txt from other thread) and g.txt (which I made using data from raw files file 1 and file 2). If you feel that it will be easy to retreive data directly from file 1 and file 2 rather that using g.txt for retreiving patterns from atom.txt, I will be happy to go for it too.

g.txt (made it more concise and short; in reality I have 10 groups for this file out of more than 600 groups from file 1. Based on decreasing number of branches they are grouped into 10 groups in g.txt but I am showing only 2 here)

Code:

Group   Centre      Branches       Id_of_Branches
 3       006          6         009,004,008,007,005,006
 5       012          2         012,013

file 1:

Code:

Group: 0 Number of Branches: 1
0    001
Centre: 001 Branches: 1
Group: 1 Number of Branches: 1
0    002
Centre: 002 Branches: 1
Group: 2 Number of Branches: 1
0    003
Centre: 003 Branches: 1
Group: 3 Number of Branches: 6
0    009
1    004
2    008
3    007
4    005
5    006
Centre: 006 Branches: 6
Group: 4 Number of Branches: 2
0    010
1    011
Centre: 010 Branches: 2
Group: 5 Number of Branches: 2
0    012
1    013
Centre: 012 Branches: 2
Upto more than 600 groups

file2:

Code:

Group No:
 3        Centre: 006 Branches: 6                   
 5        Centre: 012 Branches: 2

Required output:
Corresponding to value of Id_of_Branches from g.txt, I wish to retreive that pattern from atom.txt.
Therefore, in this sample data, I required 3 output files; 2 files corresponding to all IDs from 2 groups and 3rd file for patterns corresponding to Id of Centre from all groups:

Code:

(1) g3.txt
#009
ATOM 1 N SER A 1 35.092 83.194 140.076 1.00 0.00 N 
ATOM 2 CA SER A 1 35.216 83.725 138.725 1.00 0.00 C 
TER
ENDMDL
#004
ATOM 1 N SER A 1 34.092 83.194 140.076 1.00 0.00 N 
ATOM 2 CA SER A 1 35.216 83.725 138.725 1.00 0.00 C 
TER
ENDMDL
#008
ATOM 1 N SER A 1 45.092 83.194 140.076 1.00 0.00 N 
ATOM 2 CA SER A 1 35.216 83.725 138.725 1.00 0.00 C 
TER
ENDMDL
#007
ATOM 1 N SER A 1 50.092 83.194 140.076 1.00 0.00 N 
ATOM 2 CA SER A 1 65.216 83.725 138.725 1.00 0.00 C 
TER
ENDMDL
#005
ATOM 1 N SER A 1 90.092 83.194 140.076 1.00 0.00 N 
ATOM 2 CA SER A 1 89.216 83.725 138.725 1.00 0.00 C 
TER
ENDMDL
#006
ATOM 1 N SER A 1 67.092 83.194 140.076 1.00 0.00 N 
ATOM 2 CA SER A 1 23.216 83.725 138.725 1.00 0.00 C 
TER
ENDMDL

(2)g5.txt
#012
ATOM 1 N SER A 1 37.092 83.194 140.076 1.00 0.00 N 
ATOM 2 CA SER A 1 37.216 83.725 138.725 1.00 0.00 C 
TER
ENDMDL
#13
ATOM 1 N SER A 1 40.092 83.194 140.076 1.00 0.00 N 
ATOM 2 CA SER A 1 31.216 83.725 138.725 1.00 0.00 C 
TER
ENDMDL

(3) Centre.txt (For Id from centre of all groups)
#006
ATOM 1 N SER A 1 67.092 83.194 140.076 1.00 0.00 N 
ATOM 2 CA SER A 1 23.216 83.725 138.725 1.00 0.00 C 
TER
ENDMDL
#012
ATOM 1 N SER A 1 37.092 83.194 140.076 1.00 0.00 N 
ATOM 2 CA SER A 1 37.216 83.725 138.725 1.00 0.00 C 
TER
ENDMDL

Hope I am able to make my problem more clear.

bioinfo

View Public Profile for bioinfo

Find all posts by bioinfo

01-05-2013

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Hi bioinfo. The awk script I had been testing out given your earlier messages didn't work with the new details you provided in message #10 in this thread. (The output filenames changed from Gx to gx where x is a one to three digit string, the list of branches changed from comma and space separators to just comma separators, and I was guessing completely wrong about what you wanted in one of the output files. I think the script below does what you want. It is LONG, but the vast majority of it is just comments. Hopefullly it will help you figure out how it works:

Code:

awk '
# All data is assumed to meet the requirements stated below, so this script
# does not perform any data verification.  If any data fails to meet these
# assumptions, results are unspecified.
BEGIN {
    # Initialize variables that do not have default values set by awk.
    cf = "Centre.txt"
    rc = "001"
}
FNR == NR {
    # Process lines from atom.txt.  Assumed format is that each entry in this
    # file is a multiple line value with the final line of each entry matching
    # the ERE "^ENDMDL$".  Entries from this file are stored in array r with
    # the index being the entry number (starting with 001).  The variable rc is
    # the index for the value being accumulated.  I use a 3 digit string with
    # leading zero fill to match the format of the Centre-ID and Branch-ID
    # values that will be found in g.txt.
    r[rc] = r[rc] $0 "\n"
    if($0 == "ENDMDL")
        # End of entry found.  Set rc for the next entry to be processed.
        rc = sprintf("%03d", rc + 1)
    next
}
FNR == 1 {
    # Skip the header line on subsequent file(s).  The file g.txt is assumed to
    # be the first such file.  Any number of other files in the same format can
    # be used in addition to or instead of g.txt.
    next
}
{   # Process lines from subsequent files.  Assumed format is:
    #   Group   Centre      Branches              Id_of_Branches
    #   gid     cid         bcnt         bid[1],bid[2],...,bid[bcnt]
    # where gid is a 1-3 digit Group-ID, cid is a 3 digit (zero filled)
    # Centre-ID, bcnt is a count of the number of Branch-IDs to follow, and
    # each bid field is a 3 digit (zero filled) Branch-ID.  The header line
    # has already been discarded.  Commas will be converted to spaces so bid
    # values can be used directly.  It is assumed that each line contains
    # $3 + 3 fields.
    #
    # Create a file named gx.txt (where x is the Group-ID from this line):
    #   Note that it would seem logical to expand x to a 3 digit zero filled
    #   value so the created g* files would sort into Group-ID order, but that
    #   is not what was requested.
    #   One entry from atom.txt (with the entry number determined by the
    #   Branch-ID) will be written to this file for each Branch-ID on this
    #   line.
    #
    # Also create a file named Centre.txt that will contain one entry from
    #   atom.txt (with the entry number determined by the Centre-ID) for each
    #   line processed.
    #   Note: I assume that a Centre-ID is also a Branch-ID and that the value
    #   given as the cid should also appear as one of the Branch-IDs appearing
    #   on each line.
    #
    # Replace commas on input lines with spaces so the Branch-IDs can be used
    # directly without splitting $4 into another array and processing it in a
    # different loop (besides that some descriptions of this input file say
    # elements are comma separated and other say comma-space separated or
    # terminated; this works either way):
    gsub(/,/, " ")
    # Create the g*.txt file for this line.  Uncomment one of the following two
    # lines.  The 1st line provides requested names, the 2nd line creates names
    # that will sort correctly by Group-ID when looking at output by ls and
    # when having the shell match the patterns g*.txt and g???.txt and groups
    # in the list do not all contain the same number of digits.
    gf = "g" $1 ".txt"
    #gf = sprintf("g%03d.txt", $1)
    for(i = 4; i <= NF; i++) printf("#Id %s\n%s", $i, r[$i]) > gf
    close(gf)
    # Add entry to Centre.txt:
    printf("#Id %s\n%s", $2, r[$2]) > cf
}' atom.txt g.txt

As always, if you're running on a Solaris system, use /usr/xpg4/bin/awk or nawk instead of awk.

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

01-06-2013

Registered User

50, 0

Join Date: Dec 2012

Last Activity: 12 August 2013, 3:07 AM EDT

Posts: 50

Thanks Given: 52

Thanked 0 Times in 0 Posts

Thanks. I will try it and let you know.

---------- Post updated at 11:33 PM ---------- Previous update was at 08:18 PM ----------

Yippie. Its working.
Thanks a lot. You are a GENIUS

bioinfo

View Public Profile for bioinfo

Find all posts by bioinfo

Shell Programming and Scripting

Help in awk/bash

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

New problem with awk using bash

Discussion started by: florpi

2. Shell Programming and Scripting

Returning a value from awk to bash

Discussion started by: oahmad

3. Shell Programming and Scripting

Help in awk/bash

Discussion started by: bioinfo

4. UNIX for Dummies Questions & Answers

Help in awk/bash

Discussion started by: bioinfo

5. Shell Programming and Scripting

AWK/Bash script

Discussion started by: chrisjorg

6. UNIX for Dummies Questions & Answers

Help with BASH/AWK queries ....

Discussion started by: Fahmida

7. Shell Programming and Scripting

scripting help with bash and awk

Discussion started by: garethsays

8. Shell Programming and Scripting

awk bash help

Discussion started by: a-gopal

9. Shell Programming and Scripting

Is there any better way for sorting in bash/awk

Discussion started by: ahjiefreak

10. Shell Programming and Scripting

BASH with AWK

Discussion started by: narasimhulu