Filtering files


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Filtering files
# 15  
Old 06-08-2012
Show the input you have, the output you want, and the output you actually get please.
# 16  
Old 06-08-2012
Quote:
Originally Posted by alecapo
I apologize,
Well, I need to separate thousands of markers by names. So I have a file (names)with the markers that I need separated. I want it to be able to select those names from a master file (which contains all the markers) and create a new file with them, in the same order as in the "names" file and including all values:

Code:
masterfile.txt (tab separated):

Albumin1A713G   1   1   3   3   1   3   1   3   1        
Albumin1TC1894   1   1   1   1   1   1   1   1   1        
Albumin5G186T   1   1   1   1   1   1   1   1   1        
AY388580_a   0   0   1   2   1   2   1   2   1        
AY388582_a   3   3   1   3   1   3   1   3   1        
AY388585_a   1   1   1   3   1   3   1   1   1        
AY388587_a   1   1   1   1   1   1   1   3   1        
AY388588_a   1   3   1   1   1   1   1   1   1        
AY388589_a   1   1   1   1   1   1   1   1   1        
AY388591_a   1   1   1   2   1   2   2   2   1

names.txt

Albumin1A713G
AY388580_a
AY65789_a
AY388591_a   

desired output.txt:

Albumin1A713G   1   1   3   3   1   3   1   3   1        
AY388580_a   0   0   1   2   1   2   1   2   1        
AY388591_a   1   1   1   2   1   2   2   2   1

I hope this time is understandable..
The current code is:
Code:
awk 'NR==FNR { A[$1]++; O[++L]=$1; next }; $1 in A { A[$1]=$0 }; END { for(N=1; N<=L; N++) print O[N], A[O[N]]; }' names.txt masterfile.txt > output.txt

And what I'm getting is:
Code:
Albumin1A713G   1   1   3   3   1   3   1   3   1        
AY388580_a   0   0   1   2   1   2   1   2   1 
AY65789_a       
AY388591_a   1   1   1   2   1   2   2   2   1

and I wonder if it's possible to remove those blank cells of "non found" markers. to get it like this
Code:
Albumin1A713G   1   1   3   3   1   3   1   3   1        
AY388580_a   0   0   1   2   1   2   1   2   1        
AY388591_a   1   1   1   2   1   2   2   2   1

# 17  
Old 06-08-2012
I see my mistake now.

Code:
awk 'NR==FNR { A[$1]++; O[++L]=$1; next }; $1 in A { A[$1]=$0 }; END { for(N=1; N<=L; N++) if(A[O[N]] != 1) print O[N], A[O[N]]; }' names.txt masterfile.txt > output.txt

# 18  
Old 06-08-2012
Hi.

When I used the equivalent of:
Code:
grep -f names.txt masterfile.txt

I got the desired output, as in post #4.

Perhaps I misunderstood something in the request ... cheers, drl

Last edited by drl; 06-08-2012 at 05:55 PM..
# 19  
Old 06-08-2012
Quote:
Originally Posted by Corona688
I see my mistake now.

Code:
awk 'NR==FNR { A[$1]++; O[++L]=$1; next }; $1 in A { A[$1]=$0 }; END { for(N=1; N<=L; N++) if(A[O[N]] != 1) print O[N], A[O[N]]; }' names.txt masterfile.txt > output.txt

Thanks Corona688, the codes doesn't seem to work. strangely some of the blank spaces are removed but others are not..

Quote:
Originally Posted by drl
Hi.

When I used the equivalent of:
Code:
grep -f names.txt masterfile.txt

I got the desired output, as in post #4.

Perhaps I misunderstood something in the request ... cheers, drl
thanks drl, your code removes the blanks and filter the names but they do not keep the same order as in names.txt


Thanks guys, I really appreciate your help, please dont worry about this, I can still use the previous code and remove the blanks by hand. I really don't want to bother anymore.
Thanks a lot!

Last edited by alecapo; 06-08-2012 at 08:16 PM..
# 20  
Old 06-09-2012
Hi, alecapo.
Quote:
Originally Posted by alecapo
... thanks drl, your code removes the blanks and filter the names but they do not keep the same order as in names.txt ...
Observation: the sample data was not representative because it was already in the correct order. We can order files arbitrarily with the use of a custom collating sequence. One code that can do that is msort.

In this script, I have randomly ordered the main file, then used the grep as before, and then ordered the output based on the names file as the alternate collating sequence. Most of the code is supporting, displaying the environment, versions, etc., and then comparing the output file with the desired output:
Code:
#!/usr/bin/env bash

# @(#) s2	Demonstrate msort alternating collating sequence.
# See: http://freecode.com/projects/msort

# Section 1, setup, pre-solution.
# Infrastructure details, environment, debug commands for forum posts. 
# Uncomment export command to run script as external user.
# export PATH="/usr/local/bin:/usr/bin:/bin"
set +o nounset
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
LC_ALL=C ; LANG=C ; export LC_ALL LANG
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
edges() { local _f _n _l;: ${1?"edges: need file"}; _f=$1;_l=$(wc -l $_f);
  head -${_n:=3} $_f ; pe "--- ( $_l: lines total )" ; tail -$_n $_f ; }
C=$HOME/bin/context && [ -f $C ] && $C grep msort
set -o nounset

FILE1=${1-data1}
shift
FILE2=${1-data2}

# Display sample data files.
pe
# specimen $FILE1 $FILE2 
# edges $FILE1 3
# edges $FILE2 3
head $FILE1 $FILE2 expected-output.txt

# Section 2, solution.
pl " Results:"
db " Section 2: solution."
grep -f $FILE2 $FILE1 |
msort -q -n 1,1 -u n -l -c lexicographic -s $FILE2 |
tee f1


# Section 3, post-solution, check results, clean-up, etc.
v1=$(wc -l <expected-output.txt)
v2=$(wc -l < f1)
pl " Comparison of $v2 created lines with $v1 lines of desired results:"
db " Section 3: validate generated calculations with desired results."

pl " Comparison with desired results:"
if [ ! -f expected-output.txt -o ! -s expected-output.txt ]
then
  pe " Comparison file \"expected-output.txt\" zero-length or missing."
  exit
fi
if cmp expected-output.txt f1
then
  pe " Succeeded -- files have same content."
else
  pe " Failed -- files not identical -- detailed comparison follows."
  if diff -b expected-output.txt f1
  then
    pe " Succeeded by ignoring whitespace differences."
  fi
fi

exit 0

producing:
Code:
% ./s2

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0.8 (lenny) 
bash GNU bash 3.2.39
grep GNU grep 2.5.3
msort 8.44

==> data1 <==
AY388580_a   0   0   1   2   1   2   1   2   1        
Albumin5G186T   1   1   1   1   1   1   1   1   1        
AY388585_a   1   1   1   3   1   3   1   1   1        
AY388587_a   1   1   1   1   1   1   1   3   1        
AY388589_a   1   1   1   1   1   1   1   1   1        
Albumin1A713G   1   1   3   3   1   3   1   3   1        
AY388588_a   1   3   1   1   1   1   1   1   1        
AY388582_a   3   3   1   3   1   3   1   3   1        
AY388591_a   1   1   1   2   1   2   2   2   1
Albumin1TC1894   1   1   1   1   1   1   1   1   1        

==> data2 <==
Albumin1A713G
AY388580_a
AY388591_a   

==> expected-output.txt <==
Albumin1A713G   1   1   3   3   1   3   1   3   1        
AY388580_a   0   0   1   2   1   2   1   2   1        
AY388591_a   1   1   1   2   1   2   2   2   1

-----
 Results:
Albumin1A713G   1   1   3   3   1   3   1   3   1        
AY388580_a   0   0   1   2   1   2   1   2   1        
AY388591_a   1   1   1   2   1   2   2   2   1

-----
 Comparison of 3 created lines with 3 lines of desired results:

-----
 Comparison with desired results:
 Succeeded -- files have same content.

Best wishes ... cheers, drl
# 21  
Old 06-09-2012
Quote:
Originally Posted by alecapo
Thanks Corona688, the codes doesn't seem to work. strangely some of the blank spaces are removed but others are not..
Then your input data doesn't genuinely resemble the stuff you posted, please post a sample which doesn't work.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Need help on filtering

Hi experts, I have a file image.csv as below: COMPUTERNAME,23/07/2013,22/07/2013,21/07/2013,20/07/2013,19/07/2013,18/07/2013,17/07/2013 AED03852180,3,3,3,3,3,3,3 AED03852181,3,3,3,3,3,3,1 AED09020382,3,0,3,0,3,3,3 AED09020383,1,3,3,3,2,1,3 AED09020386,3,3,0,3,3,0,3 ... (4 Replies)
Discussion started by: zaq1xsw2
4 Replies

2. Shell Programming and Scripting

Reading 2 CSV files and filtering data based on group

I have two CSV files in the following format: First file: GroupID, PID:TID, IP, Port Sample data: 0,1000:11,127.0.0.1,445 0,-1:-1,127.0.0.1,800 1,1000:11,127.0.0.1,445 1,-1:-1,127.0.0.1,900 2,1000:11,127.0.0.1,445 2,-1:-1,180.0.0.3,900 Second file: IP,Port,PID Sample data... (6 Replies)
Discussion started by: rakesh_arxmind
6 Replies

3. Shell Programming and Scripting

Filtering files

Hi all, I have some files with different extensions. I want to list the files that doesnt end with particular extension for eg .txt. I want to list all files except .txt. How can I do the same? Thanks Ananth (2 Replies)
Discussion started by: Ananthdoss
2 Replies

4. Programming

need help with shell script filtering files and sort! newbie question?

Hi folks, I would like to get familiar with shell script programing. The first task is: write a shell script that: scans your home-folder + sub-directory for all txt-files that all users of your group are allowed to read and write then output these files sorted by date of last... (4 Replies)
Discussion started by: rollinator
4 Replies

5. Shell Programming and Scripting

Please help me to do some filtering

I have to grep a pattern. scenario is like :- Suppose "/etc/sec/one" is a string, i need to check if this string contains "one" using any utility something like if /etc/sec/one | grep ; then Thanks in advance Renjesh Raju (3 Replies)
Discussion started by: Renjesh
3 Replies

6. Shell Programming and Scripting

Filtering the yesterdays date from log files via script.

hi All, I have this sample text file - access.log: Jan 18 21:34:29 root 209.151.232.70 Jan 18 21:34:40 root 209.151.232.70 Jan 18 21:34:43 root 209.151.232.70 Jan 18 21:34:56 root 209.151.232.70 Jan 18 21:35:10 root 209.151.232.70 Jan 18 21:35:23 root 209.151.232.70 Jan 18 21:36:04 root... (2 Replies)
Discussion started by: linuxgeek
2 Replies

7. Shell Programming and Scripting

Filtering multiple files with variables

Hi, I spend few hours already searching this forum, but cannot find the solution matching exactly my case. I have multiple log files, I need to filter them so I get info about certain event. So we have files: LOGA.txt LOGB.txt LOGC.txt LOGD.txt LOGE.txt 1. I need to grep lines in... (10 Replies)
Discussion started by: Vitoriung
10 Replies

8. Shell Programming and Scripting

Indexing or Filtering code- Pattern Search by comparing two files

So here is goes to the Gurus of shell programming......I have tried a lot of different ways and its a very challenging code to write but i am enjoying it as i troubleshoot and hopefully someone can provide me a better option....Thank you in advance for your time and support....Much appreciated... ... (12 Replies)
Discussion started by: aavam
12 Replies

9. UNIX for Dummies Questions & Answers

Filtering pcap files

Hi, I am new at UNIX and programing in general and only have a basic knowledge of C++. I am helping out with some research at a college and was given the task to sort through captured packets via IP addresses. I was wondering if anyone could help me with writing a code which filters through pcap... (1 Reply)
Discussion started by: hershey101
1 Replies

10. Shell Programming and Scripting

Merging files with AWK filtering and counting lines

Hi there, I have a couple of files I need to merge. I can do a simple merge by concatenating them into one larger file. But then I need to filter the file to get a desired result. The output looks like this: TRNH 0000000010941 ORDH OADR OADR ORDL ENDT 1116399 000000003... (2 Replies)
Discussion started by: Meert
2 Replies
Login or Register to Ask a Question