![]() |
|
|
google unix.com
|
|||||||
| Forums | Register | Forum Rules | Links | Albums | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts and shell scripting languages here. |
More UNIX and Linux Forum Topics You Might Find Helpful
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| search & replace password perl script | shellscript22 | Shell Programming and Scripting | 4 | 03-25-2008 03:17 PM |
| Perl: Search for string on line then search and replace text | Crypto | Shell Programming and Scripting | 4 | 01-04-2008 10:24 AM |
| multiple input search and replace script | tungaw2004 | UNIX for Dummies Questions & Answers | 3 | 04-29-2007 08:59 AM |
| Help, sed search&replace | mle | Shell Programming and Scripting | 2 | 02-13-2004 01:28 PM |
| search and replace dynamic data in a shell script | csejl | Shell Programming and Scripting | 8 | 10-21-2003 11:33 PM |
|
|
LinkBack | Thread Tools | Search this Thread | Rate Thread | Display Modes |
|
||||
|
Molecular biologist requires help re: search / replace script
Monday April 07, 2008
Hello - I was wondering if someone could help me? I have some basic knowledge of awk, etc., and can create simple scripts (e.g. a search_replace.awk file) that can be called from the command line: $ awk -f search_replace.awk <file to be searched> I have a tab-delimited table of data (text), essentially as follows (for simplicity), a pp b a pp c a pp d a pp e a pp b a pp e a gi b a pp a b pp a d pp a t gi u t gi v t gi w t gi x t gi y t gi z z gi t y gi t v gi t y gi t t pp z I want to be able take each line, in succession, and search it against the entire file, removing duplicates. I know that I can easily do this using the uniq command (on a *sorted* file), but I also need to be able to identify mirror-image or reverse duplicates, e.g. a pp b a pp b a pp b b pp a a pp b b pp a should be reduced to a single line, a pp b (since "b pp a" is 'the same' as "a pp b"). Is this clear? Additionally, my actual file contains additional columns (fields, per row); I would like to ignore (but keep) these additional fields, just searching and replacing based on the (in the example above) fields $1, $2, $3. I think that it is possible to specify fields with regard to search / replace operations, etc. Lastly (I know that I am asking a lot), it would be ideal if the output could also keep track of how many duplicated lines there were, adding a column of "weights" (1; 2; 3; 4; etc.) indicating the numbers of duplicates in the source file, with 1 = no duplicates, 2 - one duplicate, etc. In the six-line example above, this would be 6 a pp b I have played around with the command line and some simple scripts, but this is a little beyond my grasp. I'm guessing one solution would be a grep operation, piped to / from an awk or sed command, perhaps? FYI, I am a molecular biologist / geneticist; I am trying to sort a file of perhaps 150-200,000 lines each containing 7-8 fields, for loading into a data visualization / analysis program. In the example above, the first and third columns represent specific genes, with the middle (2nd) column establishing the relationship between the first and the second gene. Note that the relationship "pp" is different than "gi", thus a pp b is different from a gi b The reason for all of this is that I do not remove duplicate mappings (including the reverse or "mirror images," e.g. "a pp b" = "b pp a"), then I get extra lines appearing in my analysis program (Cytoscape), that complicates the display (relationships between groups of genes). The reason that I asked for the "weights" is that I want to weight the edges (lines) connecting my nodes (genes) in Cytoscape, according to how many time this relationship was reported, from various assays (different type of experiments; independent analyses). If anyone could suggest some solutions, that would be *very* much appreciated! Thanking you all in advance, Sincerely, Greg S. :-) |
| Bookmarks |
| Tags |
| linux, ubuntu |
| Thread Tools | Search this Thread |
| Display Modes | Rate This Thread |
|
|