Molecular biologist requires help re: search / replace script Post: 302182845

Sponsored Content

Top Forums Shell Programming and Scripting Molecular biologist requires help re: search / replace script Post 302182845 by gstuart on Monday 7th of April 2008 03:21:07 PM

04-07-2008

Registered User

Molecular biologist requires help re: search / replace script

Monday April 07, 2008

Hello - I was wondering if someone could help me? I have some basic knowledge of awk, etc., and can create simple scripts (e.g. a search_replace.awk file) that can be called from the command line:

$ awk -f search_replace.awk <file to be searched>

I have a tab-delimited table of data (text), essentially as follows (for simplicity),

a pp b
a pp c
a pp d
a pp e
a pp b
a pp e
a gi b
a pp a
b pp a
d pp a
t gi u
t gi v
t gi w
t gi x
t gi y
t gi z
z gi t
y gi t
v gi t
y gi t
t pp z

I want to be able take each line, in succession, and search it against the entire file, removing duplicates. I know that I can easily do this using the uniq command (on a *sorted* file), but I also need to be able to identify mirror-image or reverse duplicates, e.g.

a pp b
a pp b
a pp b
b pp a
a pp b
b pp a

should be reduced to a single line,

a pp b

(since "b pp a" is 'the same' as "a pp b").

Is this clear?

Additionally, my actual file contains additional columns (fields, per row); I would like to ignore (but keep) these additional fields, just searching and replacing based on the (in the example above) fields $1, $2, $3. I think that it is possible to specify fields with regard to search / replace operations, etc.

Lastly (I know that I am asking a lot), it would be ideal if the output could also keep track of how many duplicated lines there were, adding a column of "weights" (1; 2; 3; 4; etc.) indicating the numbers of duplicates in the source file, with 1 = no duplicates, 2 - one duplicate, etc.

In the six-line example above, this would be

6 a pp b

I have played around with the command line and some simple scripts, but this is a little beyond my grasp. I'm guessing one solution would be a grep operation, piped to / from an awk or sed command, perhaps?

FYI, I am a molecular biologist / geneticist; I am trying to sort a file of perhaps 150-200,000 lines each containing 7-8 fields, for loading into a data visualization / analysis program. In the example above, the first and third columns represent specific genes, with the middle (2nd) column establishing the relationship between the first and the second gene. Note that the relationship "pp" is different than "gi", thus

a pp b

is different from

a gi b

The reason for all of this is that I do not remove duplicate mappings (including the reverse or "mirror images," e.g. "a pp b" = "b pp a"), then I get extra lines appearing in my analysis program (Cytoscape), that complicates the display (relationships between groups of genes). The reason that I asked for the "weights" is that I want to weight the edges (lines) connecting my nodes (genes) in Cytoscape, according to how many time this relationship was reported, from various assays (different type of experiments; independent analyses).

If anyone could suggest some solutions, that would be *very* much appreciated!

Thanking you all in advance,

Sincerely, Greg S. :-)

gstuart

View Public Profile for gstuart

Find all posts by gstuart

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

search and replace dynamic data in a shell script

Hi, I have a file that looks something like this: ... 0,6,256,87,0,0,0,1187443420 0,6,438,37,0,0,0,1187443380 0,2,0,0,0,10,0,1197140320 0,3,0,0,0,10,0,1197140875 0,2,0,0,0,23,0,1197140332 0,3,0,0,0,23,0,1197140437 0,2,0,0,0,17,0,1197140447 0,3,0,0,0,17,0,1197140543...

2. UNIX for Dummies Questions & Answers

multiple input search and replace script

hi, i want to create a script that will search and replace the values inside a particular file. i have 5 files that i need to change some values inside and i don't want to use vi to edit these files. All the inputted values on the script below will be passed into the files. cho "" echo...

3. UNIX for Dummies Questions & Answers

Perl search and replace not working in csh script

I am using perl to perform a search and replace. It works at the command line, but not in the csh shell script perl -pi -e 's@/Pattern@@g' $path/$file I used the @ as my delimiter because the pattern contains "/"

4. UNIX for Dummies Questions & Answers

Unix script, sed search and replace?

Hi, I am trying to write a shell script designed to take input line by line by line from a file with a word on each line for editing with sed. Example file: 1.ejverything 2.bllown 3.maikling 4.manegement 5.existjing 6.systems My design currently takes input from the user, and...

5. Shell Programming and Scripting

Script Search replace - complicated

I have a text file for which i need a script which does some fancy search and replace. Basically i want to loop through each line, if i find an occurance of certain string format then i want to carry on search on replace another line, once i replaced this line i will contine to search for the...

6. Shell Programming and Scripting

Please Help to Check script Search and Replace

Please Help to Check script Search and Replace Ex. Search 0001 and Replete un_0001 ---script Code: nawk -F\" 'NR==FNR{a;next}$2 in a{sub($2,"un_"$2)}1' input.txt file*.txt > resoult.txt script is work to one result but if i have file1.txt, file2.txt, file3.txt i want to Replace...

7. Shell Programming and Scripting

TCL script (Molecular Chemistry)

Ok, what about: array set simulation_frames { ... } foreach { frames } { writepdb pdb_$frames.pdb }Now, my question is simply, what strategy could I use to import my numbers into the array { ... } I could manually copy them, and that would work, but is there another way?

8. Shell Programming and Scripting

Script to search and replace

Hi All, I am trying to write a script which will find a particular text in certain group of files under a directory and if found correctly it will replace them with a new text in all the files. Could any one let me know how do i find the text in many files under a directory. Thanks

9. Shell Programming and Scripting

Search and replace script

Hi, Below is the script which will find a particular text and replace with another one in a group of files under a directory /test #!/bin/bash old=$1 --- first input old text new=$2--- input new text cd /test --- folder into which files need to be checked for y in `ls *`; do sed...

10. UNIX for Dummies Questions & Answers

Shell script for search and replace by field

Hi, I have an input file with below data and rules file to apply search and replace by each field in the input based on exact value or pattern. Could you please help me with unix script to read input file and rules file and then create the output and reject files based on the rules file. Input...

LEARN ABOUT NETBSD

script

SCRIPT(1)						    BSD General Commands Manual 						 SCRIPT(1)

NAME

     script -- make typescript of terminal session

SYNOPSIS

     script [-adfpqr] [-c command] [file]

DESCRIPTION

     script makes a typescript of everything printed on your terminal.	It is useful for students who need a hardcopy record of an interactive
     session as proof of an assignment, as the typescript file can be printed out later with lpr(1).

     If the argument file is given, script saves all dialogue in file.	If no file name is given, the typescript is saved in the file typescript.

     Option:

     -a      Append the output to file or typescript, retaining the prior contents.

     -c command
	     Run the named command instead of the shell.  Useful for capturing the output of a program that behaves differently when associated
	     with a tty.

     -d      When playing back a session with the -p flag, don't sleep between records when playing back a timestamped session.

     -f      Flush output after each write.  This is useful for watching the script output in real time.

     -p      Play back a session recorded with the -r flag in real time.

     -q      Be quiet, and don't output started and ended lines.

     -r      Record a session with input, output, and timestamping.

     The script ends when the forked shell exits (a control-D to exit the Bourne shell (sh(1)), and exit, logout or control-d (if ignoreeof is not
     set) for the C-shell, csh(1)).

     Certain interactive commands, such as vi(1), create garbage in the typescript file.  script works best with commands that do not manipulate
     the screen, the results are meant to emulate a hardcopy terminal.

ENVIRONMENT

     The following environment variable is used by script:

     SHELL  If the variable SHELL exists, the shell forked by script will be that shell.  If SHELL is not set, the Bourne shell is assumed.  (Most
	    shells set this variable automatically).

SEE ALSO

     csh(1) (for the history mechanism).

HISTORY

     The script command appeared in 3.0BSD.

BUGS

     script places everything in the log file, including linefeeds and backspaces.  This is not what the naive user expects.

BSD
								 October 17, 2009							       BSD