04-07-2008
Molecular biologist requires help re: search / replace script
Monday April 07, 2008
Hello - I was wondering if someone could help me? I have some basic knowledge of awk, etc., and can create simple scripts (e.g. a search_replace.awk file) that can be called from the command line:
$ awk -f search_replace.awk <file to be searched>
I have a tab-delimited table of data (text), essentially as follows (for simplicity),
a pp b
a pp c
a pp d
a pp e
a pp b
a pp e
a gi b
a pp a
b pp a
d pp a
t gi u
t gi v
t gi w
t gi x
t gi y
t gi z
z gi t
y gi t
v gi t
y gi t
t pp z
I want to be able take each line, in succession, and search it against the entire file, removing duplicates. I know that I can easily do this using the uniq command (on a *sorted* file), but I also need to be able to identify mirror-image or reverse duplicates, e.g.
a pp b
a pp b
a pp b
b pp a
a pp b
b pp a
should be reduced to a single line,
a pp b
(since "b pp a" is 'the same' as "a pp b").
Is this clear?
Additionally, my actual file contains additional columns (fields, per row); I would like to ignore (but keep) these additional fields, just searching and replacing based on the (in the example above) fields $1, $2, $3. I think that it is possible to specify fields with regard to search / replace operations, etc.
Lastly (I know that I am asking a lot), it would be ideal if the output could also keep track of how many duplicated lines there were, adding a column of "weights" (1; 2; 3; 4; etc.) indicating the numbers of duplicates in the source file, with 1 = no duplicates, 2 - one duplicate, etc.
In the six-line example above, this would be
6 a pp b
I have played around with the command line and some simple scripts, but this is a little beyond my grasp. I'm guessing one solution would be a grep operation, piped to / from an awk or sed command, perhaps?
FYI, I am a molecular biologist / geneticist; I am trying to sort a file of perhaps 150-200,000 lines each containing 7-8 fields, for loading into a data visualization / analysis program. In the example above, the first and third columns represent specific genes, with the middle (2nd) column establishing the relationship between the first and the second gene. Note that the relationship "pp" is different than "gi", thus
a pp b
is different from
a gi b
The reason for all of this is that I do not remove duplicate mappings (including the reverse or "mirror images," e.g. "a pp b" = "b pp a"), then I get extra lines appearing in my analysis program (Cytoscape), that complicates the display (relationships between groups of genes). The reason that I asked for the "weights" is that I want to weight the edges (lines) connecting my nodes (genes) in Cytoscape, according to how many time this relationship was reported, from various assays (different type of experiments; independent analyses).
If anyone could suggest some solutions, that would be *very* much appreciated!
Thanking you all in advance,
Sincerely, Greg S. :-)
10 More Discussions You Might Find Interesting
1. Shell Programming and Scripting
Hi,
I have a file that looks something like this:
...
0,6,256,87,0,0,0,1187443420
0,6,438,37,0,0,0,1187443380
0,2,0,0,0,10,0,1197140320
0,3,0,0,0,10,0,1197140875
0,2,0,0,0,23,0,1197140332
0,3,0,0,0,23,0,1197140437
0,2,0,0,0,17,0,1197140447
0,3,0,0,0,17,0,1197140543... (8 Replies)
Discussion started by: csejl
8 Replies
2. UNIX for Dummies Questions & Answers
hi,
i want to create a script that will search and replace the values inside a particular file. i have 5 files that i need to change some values inside and i don't want to use vi to edit these files. All the inputted values on the script below will be passed into the files.
cho ""
echo... (3 Replies)
Discussion started by: tungaw2004
3 Replies
3. UNIX for Dummies Questions & Answers
I am using perl to perform a search and replace. It works at the command line, but not in the csh shell script
perl -pi -e 's@/Pattern@@g' $path/$file
I used the @ as my delimiter because the pattern contains "/" (3 Replies)
Discussion started by: NobluesFDT
3 Replies
4. UNIX for Dummies Questions & Answers
Hi,
I am trying to write a shell script designed to take input line by line by line from a file with a word on each line for editing with sed. Example file:
1.ejverything
2.bllown
3.maikling
4.manegement
5.existjing
6.systems
My design currently takes input from the user, and... (2 Replies)
Discussion started by: mkfitzwilliams
2 Replies
5. Shell Programming and Scripting
I have a text file for which i need a script which does some fancy search and replace.
Basically i want to loop through each line, if i find an occurance of certain string format then i want to carry on search on replace another line, once i replaced this line i will contine to search for the... (7 Replies)
Discussion started by: kelseyh
7 Replies
6. Shell Programming and Scripting
Please Help to Check script Search and Replace
Ex. Search 0001 and Replete un_0001
---script
Code:
nawk -F\" 'NR==FNR{a;next}$2 in a{sub($2,"un_"$2)}1' input.txt file*.txt > resoult.txt
script is work to one result but
if i have file1.txt, file2.txt, file3.txt i want to Replace... (5 Replies)
Discussion started by: kittiwas
5 Replies
7. Shell Programming and Scripting
Ok,
what about:
array set simulation_frames { ... }
foreach { frames } {
writepdb pdb_$frames.pdb
}Now, my question is simply, what strategy could I use to import my numbers into the array { ... }
I could manually copy them, and that would work, but is there another way? (2 Replies)
Discussion started by: chrisjorg
2 Replies
8. Shell Programming and Scripting
Hi All,
I am trying to write a script which will find a particular text in certain group of files under a directory and if found correctly it will replace them with a new text in all the files.
Could any one let me know how do i find the text in many files under a directory.
Thanks (3 Replies)
Discussion started by: chetansingh23
3 Replies
9. Shell Programming and Scripting
Hi,
Below is the script which will find a particular text and replace with another one in a group of files under a directory /test
#!/bin/bash
old=$1 --- first input old text
new=$2--- input new text
cd /test --- folder into which files need to be checked
for y in `ls *`;
do sed... (2 Replies)
Discussion started by: chetansingh23
2 Replies
10. UNIX for Dummies Questions & Answers
Hi,
I have an input file with below data and rules file to apply search and replace by each field in the input based on exact value or pattern.
Could you please help me with unix script to read input file and rules file and then create the output and reject files based on the rules file.
Input... (13 Replies)
Discussion started by: chandrath
13 Replies
LEARN ABOUT DEBIAN
setpix
setpix(1) General Commands Manual setpix(1)
Name
setpix - Set FITS or IRAF image values
Synopsis
setpix [-vn] file.fts [x_range y_range value] [@valuefile]
Description
Set a specified pixel or range of pixels in a FITS or IRAF image to a specified value. More than one range of pixels and values may be
specified on one command line. A file of xrange yrange value triplets may be used to set multiple regions at once. The image may be over-
written or a new image created.
Options
filename
Name of IRAF image header file or FITS file. This must be present.
@coordfile
Name of file containing lines of the format
xrange yrange value where xrange and yrange are of the format n or n-n or n,n,n or n-n,n-n and value my be integer or
floating point. value will be converted to the type of the image. If a range is 0, the entire row or column specified by the other
non-zero range will be set to the indicated value. If both ranges are zero, the entire image will be set to the specified value.
New in version 2.6.4.
xrange yrange value
Image coordinate x and y ranges and the value to which that region will be set. Either one of these triplets or a file of triplets,
specified by @filename, must be present. xrange and yrange are of the format n or n-n or n,n,n or n-n,n-n and value my be integer or
floating point. value will be converted to the type of the image. If a range is 0, the entire row or column specified by the other
non-zero range will be set to the indicated value. If both ranges are zero, the entire image will be set to the specified value.
Ranges new in version 2.6.4.
-a <number>
Add constant to pixels
-d <number>
Divide pixels by constant
-i List each line which is dropped
-m <number>
Multiply pixels by constant
-n Write the output to a new file which is named by inserting an e before the file extension. The new file is always written to the
current working directory.
-s <number>
Subtract constant from pixels
-v Print more information about the process
Author
Doug Mink, SAO (dmink@cfa.harvard.edu)
6 July 2001 WCSTools setpix(1)