subsetting data


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers subsetting data
# 8  
Old 04-03-2010
Hi.

You didn't say explicitly, but looks like you want the matching lines as well as the immediately succeeding line. If so, then modern versions of the command grep can do this:
Code:
#!/usr/bin/env bash

# @(#) s1	Demonstrate print matching line plus the next line.

# Infrastructure details, environment, commands for forum posts. 
set +o nounset
LC_ALL=C ; LANG=C ; export LC_ALL LANG
echo ; echo "Environment: LC_ALL = $LC_ALL, LANG = $LANG"
echo "(Versions displayed with local utility \"version\")"
c=$( ps | grep $$ | awk '{print $NF}' )
version >/dev/null 2>&1 && s=$(_eat $0 $1) || s=""
[ "$c" = "$s" ] && p="$s" || p="$c"
version >/dev/null 2>&1 && version "=o" $p specimen grep
set -o nounset

FILE1=data1
FILE2=data2

echo
specimen data1 data2 \
|| { head -5 $FILE ; echo " --" ; tail -5 $FILE; }

echo
echo " Results:"
grep -f $FILE2 -A 1 $FILE1

echo
echo " Results, removing separator:"
grep -f $FILE2 -A 1 $FILE1 |
grep -v -e '^--'

exit 0

producing:
Code:
% ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0 
GNU bash 3.2.39
specimen (local) 1.15
GNU grep 2.5.3

Whole: 5:0:5 of 8 lines in file "data1"
>chr1 strand:+ excise_beg:554293 excise_end:554402
TAATATATTAGATTTGACCTTCAGCAAGGTCAAAGGGAGTCCGAACTAGTCT
>chr2 strand:+ excise_beg:554542 excise_end:554651
ACAGCATACCCCCGATTCCGCTACGACCAACTCATACACCTCCTATGAAAAAA
>chr3 strand:+ excise_beg:554497 excise_end:554606
GTCACCAAGACCCTACTTCTGACCTCCCTGTTCTTATGAATTCGAACAGCATA
>chr4 strand:+ excise_beg:554654 excise_end:554763
CCAGCATTCCCCCTCAAACCTAAGAAATATGTCTGATAAAAGAGTTACTTTGATA

Whole: 5:0:5 of 2 lines in file "data2"
chr1
chr3

 Results:
>chr1 strand:+ excise_beg:554293 excise_end:554402
TAATATATTAGATTTGACCTTCAGCAAGGTCAAAGGGAGTCCGAACTAGTCT
--
>chr3 strand:+ excise_beg:554497 excise_end:554606
GTCACCAAGACCCTACTTCTGACCTCCCTGTTCTTATGAATTCGAACAGCATA

 Results, removing separator:
>chr1 strand:+ excise_beg:554293 excise_end:554402
TAATATATTAGATTTGACCTTCAGCAAGGTCAAAGGGAGTCCGAACTAGTCT
>chr3 strand:+ excise_beg:554497 excise_end:554606
GTCACCAAGACCCTACTTCTGACCTCCCTGTTCTTATGAATTCGAACAGCATA

One drawback is that a separator line is automatically printed. The second sequence uses an additional grep to eliminate that separator if desired.

To get this to a 3rd file (rather than the display), use the re-direction operator, ">", as noted above.

Best wishes ... cheers, drl
# 9  
Old 04-03-2010
Hi jim mcnamara
I still get an error:

awk: extra ) at source line 2
context is
if(FILENAME=="file1" && index($1,">")!=1 ){ arr[key]=arr[key] " " >>> $0) <<<
awk: syntax error at source line 2
awk: illegal statement at source line 2
extra )

---------- Post updated at 09:27 AM ---------- Previous update was at 09:26 AM ----------

Hi rsivasan
it works. Thanks

---------- Post updated at 09:36 AM ---------- Previous update was at 09:27 AM ----------

Hi drl
that is exactely what I want; the matching line and the next line.
where do I add '>' to get the output in a file?
Thanks
# 10  
Old 04-03-2010
Hi.
Quote:
Originally Posted by jdhahbi
... Hi drl
that is exactely what I want; the matching line and the next line.
where do I add '>' to get the output in a file? ...
In the section:
Code:
grep -f $FILE2 -A 1 $FILE1 |
grep -v -e '^--'

add the code in red:
Code:
grep -f $FILE2 -A 1 $FILE1 |
grep -v -e '^--' > third_file_name

That is a very basic operation for the command line ... cheers, drl
# 11  
Old 04-03-2010
hi drl
I get an error:



$ ./my_code.sh

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")

./my_code.sh: line 20: specimen: command not found
./my_code.sh: line 21: FILE: unbound variable

---------- Post updated at 10:40 AM ---------- Previous update was at 09:53 AM ----------

Quote:
Originally Posted by rsivasan
cat file2|while read r1
do
grep $r1 file1
done
there are two problems:
problem1: chr1 will get chr10, chr11,..
how do you change the code to get chr1 only? There is a space after chr1.
problem2: it does not print the next line.

thanks

Last edited by jdhahbi; 04-03-2010 at 03:04 PM..
# 12  
Old 04-03-2010
Hi.
Quote:
Originally Posted by jdhahbi
hi drl
I get an error:

$ ./my_code.sh

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")

./my_code.sh: line 20: specimen: command not found
./my_code.sh: line 21: FILE: unbound variable
Remove the 2 lines:
Code:
specimen data1 data2 \
|| { head -5 $FILE ; echo " --" ; tail -5 $FILE; }

they were from a different script. In my environment that construct will work, not in yours ... cheers, drl
# 13  
Old 04-03-2010
Quote:
Originally Posted by drl
Hi.

Remove the 2 lines:
Code:
specimen data1 data2 \
|| { head -5 $FILE ; echo " --" ; tail -5 $FILE; }

they were from a different script. In my environment that construct will work, not in yours ... cheers, drl

it works except for one issue:
chr1 will get chr10, chr11,..
how do you change the code to get chr1 only? There is a space after chr1.
# 14  
Old 04-03-2010
Hi.
Quote:
Originally Posted by jdhahbi
it works except for one issue:
chr1 will get chr10, chr11,..
how do you change the code to get chr1 only? There is a space after chr1.
Observations:

1) there was no "chr10", "chr11" in your sample data set.

2) Your pattern file includes the 3 characters "chr1" on one of the lines. Where do you think a space should placed?

Best wishes ... cheers, drl
 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk --> math-operation in data-record and joining with second file data

Hi! I have a pretty complex job - at least for me! i have two csv-files with meassurement-data: fileA ...... (2 Replies)
Discussion started by: IMPe
2 Replies

2. Shell Programming and Scripting

Parsing XML (and insert data) then output data (bash / Solaris)

Hi folks I have a script I wrote that basically parses a bunch of config and xml files works out were to add in the new content then spits out the data into a new file. It all works - apart from the xml and config file format in the new file with XML files the original XML (that ends up in... (2 Replies)
Discussion started by: dfinch
2 Replies

3. Shell Programming and Scripting

Generate tabular data based on a column value from an existing data file

Hi, I have a data file with : 01/28/2012,1,1,98995 01/28/2012,1,2,7195 01/29/2012,1,1,98995 01/29/2012,1,2,7195 01/30/2012,1,1,98896 01/30/2012,1,2,7083 01/31/2012,1,1,98896 01/31/2012,1,2,7083 02/01/2012,1,1,98896 02/01/2012,1,2,7083 02/02/2012,1,1,98899 02/02/2012,1,2,7083 I... (1 Reply)
Discussion started by: himanish
1 Replies

4. Shell Programming and Scripting

Converting variable space width data into CSV data in bash

Hi All, I was wondering how I can convert each line in an input file where fields are separated by variable width spaces into a CSV file. Below is the scenario what I am looking for. My Input data in inputfile.txt 19 15657 15685 Sr2dReader 107.88 105.51... (4 Replies)
Discussion started by: vharsha
4 Replies

5. UNIX for Dummies Questions & Answers

How to get data only inside polygon created by points which is part of whole data from file?

hiii, Help me out..i have a huge set of data stored in a file.This file has has 2 columns which is latitude & longitude of a region. Now i have a program which asks for the number of points & based on this number it asks the user to enter that latitude & longitude values which are in the same... (7 Replies)
Discussion started by: reva
7 Replies

6. Shell Programming and Scripting

Extract data based on match against one column data from a long list data

My input file: data_5 Ali 422 2.00E-45 102/253 140/253 24 data_3 Abu 202 60.00E-45 12/23 140/23 28 data_1 Ahmad 256 7.00E-45 120/235 140/235 22 data_4 Aman 365 8.00E-45 15/65 140/65 20 data_10 Jones 869 9.00E-45 65/253 140/253 18... (12 Replies)
Discussion started by: patrick87
12 Replies

7. Shell Programming and Scripting

subsetting lines with grep

Hi my file has two columns: GAII_4:6:100:548:645/1 GTACACAACCCCCCCCCCCCACCCCACCCCCCCCCCCCCC GAII_4:6:100:1:1242/1 AGTCTGCCCCTCCCCCTNNNNNNNTCTTTTNCCTCCTCCT GAII_4:6:100:444:504/1 GTAACACACACCCTGATACTCCCCCCTCCACAACCGCTCT I want to subset the lines that start with GT in the second column... (5 Replies)
Discussion started by: jdhahbi
5 Replies

8. UNIX for Dummies Questions & Answers

subsetting data

I have a file where the data is stored in 6 columns, I would like to subset only lines with the fourth column is blank. Can anybody help me with this? Thanks Joseph (19 Replies)
Discussion started by: jdhahbi
19 Replies

9. Shell Programming and Scripting

how to verify that copied data to remote system is identical with local data.

I have created simple shell script #!/bin/sh echo `date`; echo "Start .... find . -mtime +95 -print > /tmp/files.txt for file in `cat /tmp/files.txt` do echo "copying file - $file" /usr/local/bin/scp -p -P 2222 $file remote.hostname:/file/path echo "copid file -... (3 Replies)
Discussion started by: ynilesh
3 Replies

10. UNIX for Dummies Questions & Answers

Howto capture data from rs232port andpull data into oracle database-9i automatically

Hi, i willbe very much grateful to u if u help me out.. if i simply connect pbx machine to printer by serial port RS232 then we find this view: But i want to capture this data into database automatically when the pbx is running.The table in database will contain similar to this view inthe... (1 Reply)
Discussion started by: boss
1 Replies
Login or Register to Ask a Question