Filtering files

06-08-2012

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Show the input you have, the output you want, and the output you actually get please.

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

06-08-2012

Registered User

9, 0

Join Date: May 2012

Last Activity: 10 June 2012, 2:39 AM EDT

Posts: 9

Thanks Given: 2

Thanked 0 Times in 0 Posts

Quote:

Originally Posted by alecapo

I apologize,
Well, I need to separate thousands of markers by names. So I have a file (names)with the markers that I need separated. I want it to be able to select those names from a master file (which contains all the markers) and create a new file with them, in the same order as in the "names" file and including all values:

Code:

masterfile.txt (tab separated):

Albumin1A713G   1   1   3   3   1   3   1   3   1        
Albumin1TC1894   1   1   1   1   1   1   1   1   1        
Albumin5G186T   1   1   1   1   1   1   1   1   1        
AY388580_a   0   0   1   2   1   2   1   2   1        
AY388582_a   3   3   1   3   1   3   1   3   1        
AY388585_a   1   1   1   3   1   3   1   1   1        
AY388587_a   1   1   1   1   1   1   1   3   1        
AY388588_a   1   3   1   1   1   1   1   1   1        
AY388589_a   1   1   1   1   1   1   1   1   1        
AY388591_a   1   1   1   2   1   2   2   2   1

names.txt

Albumin1A713G
AY388580_a
AY65789_a
AY388591_a   

desired output.txt:

Albumin1A713G   1   1   3   3   1   3   1   3   1        
AY388580_a   0   0   1   2   1   2   1   2   1        
AY388591_a   1   1   1   2   1   2   2   2   1

I hope this time is understandable..

The current code is:

Code:

awk 'NR==FNR { A[$1]++; O[++L]=$1; next }; $1 in A { A[$1]=$0 }; END { for(N=1; N<=L; N++) print O[N], A[O[N]]; }' names.txt masterfile.txt > output.txt

And what I'm getting is:

Code:

Albumin1A713G   1   1   3   3   1   3   1   3   1        
AY388580_a   0   0   1   2   1   2   1   2   1 
AY65789_a       
AY388591_a   1   1   1   2   1   2   2   2   1

and I wonder if it's possible to remove those blank cells of "non found" markers. to get it like this

Code:

Albumin1A713G   1   1   3   3   1   3   1   3   1        
AY388580_a   0   0   1   2   1   2   1   2   1        
AY388591_a   1   1   1   2   1   2   2   2   1

alecapo

View Public Profile for alecapo

Find all posts by alecapo

06-08-2012

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

I see my mistake now.

Code:

awk 'NR==FNR { A[$1]++; O[++L]=$1; next }; $1 in A { A[$1]=$0 }; END { for(N=1; N<=L; N++) if(A[O[N]] != 1) print O[N], A[O[N]]; }' names.txt masterfile.txt > output.txt

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

06-08-2012

Registered User

2,288, 480

Join Date: Apr 2007

Last Activity: 3 May 2020, 8:28 AM EDT

Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris

Posts: 2,288

Thanks Given: 430

Thanked 480 Times in 395 Posts

Hi.

When I used the equivalent of:

Code:

grep -f names.txt masterfile.txt

I got the desired output, as in post #4.

Perhaps I misunderstood something in the request ... cheers, drl

Last edited by drl; 06-08-2012 at 05:55 PM..

drl

View Public Profile for drl

Find all posts by drl

06-08-2012

Registered User

9, 0

Join Date: May 2012

Last Activity: 10 June 2012, 2:39 AM EDT

Posts: 9

Thanks Given: 2

Thanked 0 Times in 0 Posts

Quote:

Originally Posted by Corona688

I see my mistake now.

Code:

awk 'NR==FNR { A[$1]++; O[++L]=$1; next }; $1 in A { A[$1]=$0 }; END { for(N=1; N<=L; N++) if(A[O[N]] != 1) print O[N], A[O[N]]; }' names.txt masterfile.txt > output.txt

Thanks Corona688, the codes doesn't seem to work. strangely some of the blank spaces are removed but others are not..

Quote:

Originally Posted by drl

Hi.

When I used the equivalent of:

Code:

grep -f names.txt masterfile.txt

I got the desired output, as in post #4.

Perhaps I misunderstood something in the request ... cheers, drl

thanks drl, your code removes the blanks and filter the names but they do not keep the same order as in names.txt

Thanks guys, I really appreciate your help, please dont worry about this, I can still use the previous code and remove the blanks by hand. I really don't want to bother anymore.
Thanks a lot!

Last edited by alecapo; 06-08-2012 at 08:16 PM..

alecapo

View Public Profile for alecapo

Find all posts by alecapo

06-09-2012

Registered User

2,288, 480

Join Date: Apr 2007

Last Activity: 3 May 2020, 8:28 AM EDT

Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris

Posts: 2,288

Thanks Given: 430

Thanked 480 Times in 395 Posts

Hi, alecapo.

Quote:

Originally Posted by alecapo

... thanks drl, your code removes the blanks and filter the names but they do not keep the same order as in names.txt ...

Observation: the sample data was not representative because it was already in the correct order. We can order files arbitrarily with the use of a custom collating sequence. One code that can do that is msort.

In this script, I have randomly ordered the main file, then used the grep as before, and then ordered the output based on the names file as the alternate collating sequence. Most of the code is supporting, displaying the environment, versions, etc., and then comparing the output file with the desired output:

Code:

#!/usr/bin/env bash

# @(#) s2	Demonstrate msort alternating collating sequence.
# See: http://freecode.com/projects/msort

# Section 1, setup, pre-solution.
# Infrastructure details, environment, debug commands for forum posts. 
# Uncomment export command to run script as external user.
# export PATH="/usr/local/bin:/usr/bin:/bin"
set +o nounset
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
LC_ALL=C ; LANG=C ; export LC_ALL LANG
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
edges() { local _f _n _l;: ${1?"edges: need file"}; _f=$1;_l=$(wc -l $_f);
  head -${_n:=3} $_f ; pe "--- ( $_l: lines total )" ; tail -$_n $_f ; }
C=$HOME/bin/context && [ -f $C ] && $C grep msort
set -o nounset

FILE1=${1-data1}
shift
FILE2=${1-data2}

# Display sample data files.
pe
# specimen $FILE1 $FILE2 
# edges $FILE1 3
# edges $FILE2 3
head $FILE1 $FILE2 expected-output.txt

# Section 2, solution.
pl " Results:"
db " Section 2: solution."
grep -f $FILE2 $FILE1 |
msort -q -n 1,1 -u n -l -c lexicographic -s $FILE2 |
tee f1


# Section 3, post-solution, check results, clean-up, etc.
v1=$(wc -l <expected-output.txt)
v2=$(wc -l < f1)
pl " Comparison of $v2 created lines with $v1 lines of desired results:"
db " Section 3: validate generated calculations with desired results."

pl " Comparison with desired results:"
if [ ! -f expected-output.txt -o ! -s expected-output.txt ]
then
  pe " Comparison file \"expected-output.txt\" zero-length or missing."
  exit
fi
if cmp expected-output.txt f1
then
  pe " Succeeded -- files have same content."
else
  pe " Failed -- files not identical -- detailed comparison follows."
  if diff -b expected-output.txt f1
  then
    pe " Succeeded by ignoring whitespace differences."
  fi
fi

exit 0

producing:

Code:

% ./s2

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0.8 (lenny) 
bash GNU bash 3.2.39
grep GNU grep 2.5.3
msort 8.44

==> data1 <==
AY388580_a   0   0   1   2   1   2   1   2   1        
Albumin5G186T   1   1   1   1   1   1   1   1   1        
AY388585_a   1   1   1   3   1   3   1   1   1        
AY388587_a   1   1   1   1   1   1   1   3   1        
AY388589_a   1   1   1   1   1   1   1   1   1        
Albumin1A713G   1   1   3   3   1   3   1   3   1        
AY388588_a   1   3   1   1   1   1   1   1   1        
AY388582_a   3   3   1   3   1   3   1   3   1        
AY388591_a   1   1   1   2   1   2   2   2   1
Albumin1TC1894   1   1   1   1   1   1   1   1   1        

==> data2 <==
Albumin1A713G
AY388580_a
AY388591_a   

==> expected-output.txt <==
Albumin1A713G   1   1   3   3   1   3   1   3   1        
AY388580_a   0   0   1   2   1   2   1   2   1        
AY388591_a   1   1   1   2   1   2   2   2   1

-----
 Results:
Albumin1A713G   1   1   3   3   1   3   1   3   1        
AY388580_a   0   0   1   2   1   2   1   2   1        
AY388591_a   1   1   1   2   1   2   2   2   1

-----
 Comparison of 3 created lines with 3 lines of desired results:

-----
 Comparison with desired results:
 Succeeded -- files have same content.

Best wishes ... cheers, drl

drl

View Public Profile for drl

Find all posts by drl

06-09-2012

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Quote:

Originally Posted by alecapo

Thanks Corona688, the codes doesn't seem to work. strangely some of the blank spaces are removed but others are not..

Then your input data doesn't genuinely resemble the stuff you posted, please post a sample which doesn't work.

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

Shell Programming and Scripting

Filtering files

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Need help on filtering

Discussion started by: zaq1xsw2

2. Shell Programming and Scripting

Reading 2 CSV files and filtering data based on group

Discussion started by: rakesh_arxmind

3. Shell Programming and Scripting

Filtering files

Discussion started by: Ananthdoss

4. Programming

need help with shell script filtering files and sort! newbie question?

Discussion started by: rollinator

5. Shell Programming and Scripting

Please help me to do some filtering

Discussion started by: Renjesh

6. Shell Programming and Scripting

Filtering the yesterdays date from log files via script.

Discussion started by: linuxgeek

7. Shell Programming and Scripting

Filtering multiple files with variables

Discussion started by: Vitoriung

8. Shell Programming and Scripting

Indexing or Filtering code- Pattern Search by comparing two files

Discussion started by: aavam

9. UNIX for Dummies Questions & Answers

Filtering pcap files

Discussion started by: hershey101

10. Shell Programming and Scripting

Merging files with AWK filtering and counting lines

Discussion started by: Meert