04-07-2008
Molecular biologist requires help re: search / replace script
Monday April 07, 2008
Hello - I was wondering if someone could help me? I have some basic knowledge of awk, etc., and can create simple scripts (e.g. a search_replace.awk file) that can be called from the command line:
$ awk -f search_replace.awk <file to be searched>
I have a tab-delimited table of data (text), essentially as follows (for simplicity),
a pp b
a pp c
a pp d
a pp e
a pp b
a pp e
a gi b
a pp a
b pp a
d pp a
t gi u
t gi v
t gi w
t gi x
t gi y
t gi z
z gi t
y gi t
v gi t
y gi t
t pp z
I want to be able take each line, in succession, and search it against the entire file, removing duplicates. I know that I can easily do this using the uniq command (on a *sorted* file), but I also need to be able to identify mirror-image or reverse duplicates, e.g.
a pp b
a pp b
a pp b
b pp a
a pp b
b pp a
should be reduced to a single line,
a pp b
(since "b pp a" is 'the same' as "a pp b").
Is this clear?
Additionally, my actual file contains additional columns (fields, per row); I would like to ignore (but keep) these additional fields, just searching and replacing based on the (in the example above) fields $1, $2, $3. I think that it is possible to specify fields with regard to search / replace operations, etc.
Lastly (I know that I am asking a lot), it would be ideal if the output could also keep track of how many duplicated lines there were, adding a column of "weights" (1; 2; 3; 4; etc.) indicating the numbers of duplicates in the source file, with 1 = no duplicates, 2 - one duplicate, etc.
In the six-line example above, this would be
6 a pp b
I have played around with the command line and some simple scripts, but this is a little beyond my grasp. I'm guessing one solution would be a grep operation, piped to / from an awk or sed command, perhaps?
FYI, I am a molecular biologist / geneticist; I am trying to sort a file of perhaps 150-200,000 lines each containing 7-8 fields, for loading into a data visualization / analysis program. In the example above, the first and third columns represent specific genes, with the middle (2nd) column establishing the relationship between the first and the second gene. Note that the relationship "pp" is different than "gi", thus
a pp b
is different from
a gi b
The reason for all of this is that I do not remove duplicate mappings (including the reverse or "mirror images," e.g. "a pp b" = "b pp a"), then I get extra lines appearing in my analysis program (Cytoscape), that complicates the display (relationships between groups of genes). The reason that I asked for the "weights" is that I want to weight the edges (lines) connecting my nodes (genes) in Cytoscape, according to how many time this relationship was reported, from various assays (different type of experiments; independent analyses).
If anyone could suggest some solutions, that would be *very* much appreciated!
Thanking you all in advance,
Sincerely, Greg S. :-)
10 More Discussions You Might Find Interesting
1. Shell Programming and Scripting
Hi,
I have a file that looks something like this:
...
0,6,256,87,0,0,0,1187443420
0,6,438,37,0,0,0,1187443380
0,2,0,0,0,10,0,1197140320
0,3,0,0,0,10,0,1197140875
0,2,0,0,0,23,0,1197140332
0,3,0,0,0,23,0,1197140437
0,2,0,0,0,17,0,1197140447
0,3,0,0,0,17,0,1197140543... (8 Replies)
Discussion started by: csejl
8 Replies
2. UNIX for Dummies Questions & Answers
hi,
i want to create a script that will search and replace the values inside a particular file. i have 5 files that i need to change some values inside and i don't want to use vi to edit these files. All the inputted values on the script below will be passed into the files.
cho ""
echo... (3 Replies)
Discussion started by: tungaw2004
3 Replies
3. UNIX for Dummies Questions & Answers
I am using perl to perform a search and replace. It works at the command line, but not in the csh shell script
perl -pi -e 's@/Pattern@@g' $path/$file
I used the @ as my delimiter because the pattern contains "/" (3 Replies)
Discussion started by: NobluesFDT
3 Replies
4. UNIX for Dummies Questions & Answers
Hi,
I am trying to write a shell script designed to take input line by line by line from a file with a word on each line for editing with sed. Example file:
1.ejverything
2.bllown
3.maikling
4.manegement
5.existjing
6.systems
My design currently takes input from the user, and... (2 Replies)
Discussion started by: mkfitzwilliams
2 Replies
5. Shell Programming and Scripting
I have a text file for which i need a script which does some fancy search and replace.
Basically i want to loop through each line, if i find an occurance of certain string format then i want to carry on search on replace another line, once i replaced this line i will contine to search for the... (7 Replies)
Discussion started by: kelseyh
7 Replies
6. Shell Programming and Scripting
Please Help to Check script Search and Replace
Ex. Search 0001 and Replete un_0001
---script
Code:
nawk -F\" 'NR==FNR{a;next}$2 in a{sub($2,"un_"$2)}1' input.txt file*.txt > resoult.txt
script is work to one result but
if i have file1.txt, file2.txt, file3.txt i want to Replace... (5 Replies)
Discussion started by: kittiwas
5 Replies
7. Shell Programming and Scripting
Ok,
what about:
array set simulation_frames { ... }
foreach { frames } {
writepdb pdb_$frames.pdb
}Now, my question is simply, what strategy could I use to import my numbers into the array { ... }
I could manually copy them, and that would work, but is there another way? (2 Replies)
Discussion started by: chrisjorg
2 Replies
8. Shell Programming and Scripting
Hi All,
I am trying to write a script which will find a particular text in certain group of files under a directory and if found correctly it will replace them with a new text in all the files.
Could any one let me know how do i find the text in many files under a directory.
Thanks (3 Replies)
Discussion started by: chetansingh23
3 Replies
9. Shell Programming and Scripting
Hi,
Below is the script which will find a particular text and replace with another one in a group of files under a directory /test
#!/bin/bash
old=$1 --- first input old text
new=$2--- input new text
cd /test --- folder into which files need to be checked
for y in `ls *`;
do sed... (2 Replies)
Discussion started by: chetansingh23
2 Replies
10. UNIX for Dummies Questions & Answers
Hi,
I have an input file with below data and rules file to apply search and replace by each field in the input based on exact value or pattern.
Could you please help me with unix script to read input file and rules file and then create the output and reject files based on the rules file.
Input... (13 Replies)
Discussion started by: chandrath
13 Replies
LEARN ABOUT NETBSD
script
SCRIPT(1) BSD General Commands Manual SCRIPT(1)
NAME
script -- make typescript of terminal session
SYNOPSIS
script [-adfpqr] [-c command] [file]
DESCRIPTION
script makes a typescript of everything printed on your terminal. It is useful for students who need a hardcopy record of an interactive
session as proof of an assignment, as the typescript file can be printed out later with lpr(1).
If the argument file is given, script saves all dialogue in file. If no file name is given, the typescript is saved in the file typescript.
Option:
-a Append the output to file or typescript, retaining the prior contents.
-c command
Run the named command instead of the shell. Useful for capturing the output of a program that behaves differently when associated
with a tty.
-d When playing back a session with the -p flag, don't sleep between records when playing back a timestamped session.
-f Flush output after each write. This is useful for watching the script output in real time.
-p Play back a session recorded with the -r flag in real time.
-q Be quiet, and don't output started and ended lines.
-r Record a session with input, output, and timestamping.
The script ends when the forked shell exits (a control-D to exit the Bourne shell (sh(1)), and exit, logout or control-d (if ignoreeof is not
set) for the C-shell, csh(1)).
Certain interactive commands, such as vi(1), create garbage in the typescript file. script works best with commands that do not manipulate
the screen, the results are meant to emulate a hardcopy terminal.
ENVIRONMENT
The following environment variable is used by script:
SHELL If the variable SHELL exists, the shell forked by script will be that shell. If SHELL is not set, the Bourne shell is assumed. (Most
shells set this variable automatically).
SEE ALSO
csh(1) (for the history mechanism).
HISTORY
The script command appeared in 3.0BSD.
BUGS
script places everything in the log file, including linefeeds and backspaces. This is not what the naive user expects.
BSD
October 17, 2009 BSD