Visit The New, Modern Unix Linux Community


Failure using regex with awk in 'while read file' loop


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Failure using regex with awk in 'while read file' loop
# 1  
Failure using regex with awk in 'while read file' loop

I have a file1.txt with several 100k lines, each of which has a column 9 containing one of 60 "label" identifiers. Using an labels.txt file containing a list of labels, I'd like to extract 200 random lines from file1.txt for each of the labels in index.txt.

Using a contrived mini-example:
Code:
$ cat file1.txt 
H	0	328	100.0	-	0	0	38D150M140D	M01433:68:000000000-AAT0D:1:1111:13371:3239;barcodelabel=c8;	OTU_1;size=17947;
H	1	325	100.0	+	0	0	150M175D	M01433:68:000000000-AAT0D:1:1105:27659:19941;barcodelabel=c12;	OTU_2;size=101;
H	4	411	99.3	+	0	0	24D150M237D	M01433:68:000000000-AAT0D:1:2107:16393:23698;barcodelabel=g10;	OTU_5;size=64;
H	2	283	98.7	+	0	0	150M133D	M01433:68:000000000-AAT0D:1:2104:21919:3018;barcodelabel=c12;	OTU_3;size=80;
H	1	277	98.5	-	0	0	15I135M142D	M01433:68:000000000-AAT0D:1:2108:12616:12185;barcodelabel=c12;	OTU_2;size=101;
H	0	295	100.0	+	0	0	14D150M131D	M01433:68:000000000-AAT0D:1:1108:4978:15986;barcodelabel=g10;	OTU_1;size=17947;
H	29	312	97.6	-	0	0	25I125M187D	M01433:68:000000000-AAT0D:1:1109:20934:22671;barcodelabel=g15;	OTU_30;size=8;
H	0	315	99.3	-	0	0	88D150M77D	M01433:68:000000000-AAT0D:1:2114:17509:23920;barcodelabel=g10;	OTU_1;size=17947;

$ cat labels.txt
c12
g10

This is what I'm trying, but it results in empty files:
Code:
$ while read file
> do
> awk '/${file}/' file1.txt | gshuf -n 200 > ${file}.txt
> done < labels.txt

Desired output--two random lines for each label in labels.txt (i.e. may vary except for "label=c12" or "label=g12", respectively):
Code:
$ cat c12.txt
H	1	325	100.0	+	0	0	150M175D	M01433:68:000000000-AAT0D:1:1105:27659:19941;barcodelabel=c12;	OTU_2;size=101;
H	2	283	98.7	+	0	0	150M133D	M01433:68:000000000-AAT0D:1:2104:21919:3018;barcodelabel=c12;	OTU_3;size=80;

$ cat g10.txt
H	0	295	100.0	+	0	0	14D150M131D	M01433:68:000000000-AAT0D:1:1108:4978:15986;barcodelabel=g10;	OTU_1;size=17947;
H	0	315	99.3	-	0	0	88D150M77D	M01433:68:000000000-AAT0D:1:2114:17509:23920;barcodelabel=g10;	OTU_1;size=17947;

It seems like the problem is with the " awk '/${file}/' "? I say this because I can extract lines for each label but only if I explicitly specify the label regex (in this case g10.txt also has two random lines with "label=c12" instead of g10):
Code:
$ while read file
> do
> awk '/c12/' file1.txt | gshuf -n 2 > ${file}.txt
> done < labels.txt
$ cat c12.txt 
H	1	277	98.5	-	0	0	15I135M142D	M01433:68:000000000-AAT0D:1:2108:12616:12185;barcodelabel=c12;	OTU_2;size=101;
H	1	325	100.0	+	0	0	150M175D	M01433:68:000000000-AAT0D:1:1105:27659:19941;barcodelabel=c12;	OTU_2;size=101;

Thanks for any pointers.
# 2  
Try
Code:
awk 'FNR==NR{A[$1];next}$3 in A && A[$3] < limit {  print > sprintf("%s.txt",$3); A[$3]+=1 }' limit=2  labels.txt FS='[=;]' file1.txt

OR

Code:
awk 'FNR==NR{A[$1];next}{split($9,T,/[=;]/)}T[3] in A && A[T[3]] < limit{print > sprintf("%s.txt",T[3]); A[T[3]]+=1 }' limit=2 labels.txt  file1.txt


Test result


Code:
[akshay@localhost tmp]$ cat labels.txt 
c12
g10

Code:
[akshay@localhost tmp]$ cat file1.txt
H	0	328	100.0	-	0	0	38D150M140D	M01433:68:000000000-AAT0D:1:1111:13371:3239;barcodelabel=c8;	OTU_1;size=17947;
H	1	325	100.0	+	0	0	150M175D	M01433:68:000000000-AAT0D:1:1105:27659:19941;barcodelabel=c12;	OTU_2;size=101;
H	4	411	99.3	+	0	0	24D150M237D	M01433:68:000000000-AAT0D:1:2107:16393:23698;barcodelabel=g10;	OTU_5;size=64;
H	2	283	98.7	+	0	0	150M133D	M01433:68:000000000-AAT0D:1:2104:21919:3018;barcodelabel=c12;	OTU_3;size=80;
H	1	277	98.5	-	0	0	15I135M142D	M01433:68:000000000-AAT0D:1:2108:12616:12185;barcodelabel=c12;	OTU_2;size=101;
H	0	295	100.0	+	0	0	14D150M131D	M01433:68:000000000-AAT0D:1:1108:4978:15986;barcodelabel=g10;	OTU_1;size=17947;
H	29	312	97.6	-	0	0	25I125M187D	M01433:68:000000000-AAT0D:1:1109:20934:22671;barcodelabel=g15;	OTU_30;size=8;
H	0	315	99.3	-	0	0	88D150M77D	M01433:68:000000000-AAT0D:1:2114:17509:23920;barcodelabel=g10;	OTU_1;size=17947;

Code:
[akshay@localhost tmp]$ awk 'FNR==NR{A[$1];next}$3 in A && A[$3] < limit {  print > sprintf("%s.txt",$3); A[$3]+=1 }' limit=2  labels.txt FS='[=;]' file1.txt

Code:
[akshay@localhost tmp]$ cat c12.txt 
H	1	325	100.0	+	0	0	150M175D	M01433:68:000000000-AAT0D:1:1105:27659:19941;barcodelabel=c12;	OTU_2;size=101;
H	2	283	98.7	+	0	0	150M133D	M01433:68:000000000-AAT0D:1:2104:21919:3018;barcodelabel=c12;	OTU_3;size=80;

Code:
[akshay@localhost tmp]$ cat g10.txt 
H	4	411	99.3	+	0	0	24D150M237D	M01433:68:000000000-AAT0D:1:2107:16393:23698;barcodelabel=g10;	OTU_5;size=64;
H	0	295	100.0	+	0	0	14D150M131D	M01433:68:000000000-AAT0D:1:1108:4978:15986;barcodelabel=g10;	OTU_1;size=17947;


Last edited by Akshay Hegde; 04-07-2015 at 01:57 AM..
# 3  
This successfully extracts lines with the correct label. But I am looking for only two lines for each label in labels.txt, and the lines should be randomly chosen. Can this script be piped to " gshuf -n 2"--or the equivalent--before the file is created?

What does $3 do in this code?
# 4  
With FS='[=;]', the 1st field on each line in your sample data is the string of characters before the 1st semicolon, the 2nd field is the string barcodelabel, and the 3rd field (referenced in awk by $3) is the string between the equals sign and the 2nd semicolon (which is what you're looking for to match a string read from labels.txt).
# 5  
Do I have this basically right?
Code:
{
    A[$1]; # Build array on the first column of the file
    next  # Skip all proceeding blocks and process next line
}
$3 in A # Check in the value in column one of the second files is in the array
{
    # If so print it it to file named based on $3
print > sprintf("%s.txt",$3) 
}

If so, how can I adapt this code to give me only a random subset of matched lines for each array value? Is there a way to redirect the match specified by $3 in A to gshuf -n 2 before printing?
# 6  
pipe the 'awk' output into whatever you want to get you random output.
# 7  
I know that I could pipe the output file and get my desired output. But I would like to accomplish it all at once if possible--find matching lines in file1.txt based on the labels in labels.txt, and print a random subset of two of these matching lines to an output file named according to the label.

Previous Thread | Next Thread
Thread Tools Search this Thread
Search this Thread:
Advanced Search

Test Your Knowledge in Computers #118
Difficulty: Easy
A Unix-like OS is the one that works like Unix systems, however, Unix-like system do not necessarily conform to the Single UNIX Specification (SUS) or POSIX (Portable Operating System Interface) standards.
True or False?

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Failure: if grep "$Var" "$line" inside while read line loop

Hi everybody, I am new at Unix/Bourne shell scripting and with my youngest experiences, I will not become very old with it :o My code: #!/bin/sh set -e set -u export IFS= optl="Optl" LOCSTORCLI="/opt/lsi/storcli/storcli" ($LOCSTORCLI /c0 /vall show | grep RAID | cut -d " "... (5 Replies)
Discussion started by: Subsonic66
5 Replies

2. Shell Programming and Scripting

Use while loop to read file and use ${file} for both filename input into awk and as string to print

I have files named with different prefixes. From each I want to extract the first line containing a specific string, and then print that line along with the prefix. I've tried to do this with a while loop, but instead of printing the prefix I print the first line of the file twice. Files:... (3 Replies)
Discussion started by: pathunkathunk
3 Replies

3. Shell Programming and Scripting

For loop inside awk to read and print contents of files

Hello, I have a set of files Xfile0001 - Xfile0021, and the content of this files (one at a time) needs to be printed between some line (lines start with word "Generated") that I am extracting from another file called file7.txt and all the output goes into output.txt. First I tried creating a for... (5 Replies)
Discussion started by: jaldo0805
5 Replies

4. Shell Programming and Scripting

Using awk instead of while loop to read file

Hello, I have a huge file, I am currently using while loop to read and do some calculation on it, but it is taking a lot of time. I want to use AWK to read and do those calculations. Please suggest. currently doing: cat input2 | while read var1 do num=`echo $var1 | awk... (6 Replies)
Discussion started by: anand2308
6 Replies

5. UNIX for Dummies Questions & Answers

read regex from ID file, print regex and line below from source file

I have a file of protein sequences with headers (my source file). Based on a list of IDs (which are included in some of the headers), I'd like to print out only the specified sequences, with only the ID as header. In other words, I'd like to search source.txt for the terms in IDs.txt, and print... (3 Replies)
Discussion started by: pathunkathunk
3 Replies

6. Shell Programming and Scripting

IF awk in a while read line-loop

Hi As a newbe in scripting, i struggle hard with my first script. What i want to do is, bringing data of two files together. file1: .... 05/14/12-04:00:00 41253 4259 5135 5604 5812 5372 05/14/12-04:10:00 53408 5501 6592 7402 7354 6639 05/14/12-04:20:00 58748 6037 7292 8223... (13 Replies)
Discussion started by: IMPe
13 Replies

7. SCO

file system not getting mounted in read write mode after system power failure

After System power get failed File system is not getting mounted in read- write mode (1 Reply)
Discussion started by: gtkpmbpl
1 Replies

8. Shell Programming and Scripting

How to Read the entire file using while loop

Guys, I am trying to read the whole file using while loop but when i run the ssh part of the script it reads only the first line and exit after that. There are in total 134 lines in the file, but when the output is redirected, it does only for one line and comes to command prompt. pls help..... (11 Replies)
Discussion started by: sdosanjh
11 Replies

9. UNIX for Dummies Questions & Answers

How to read a file in unix using do....done loop

Hi , can some give me idea about how to use do...done while loop in UNIX to read the contents of a file.. (2 Replies)
Discussion started by: sreenusola
2 Replies

10. Shell Programming and Scripting

Read from a file and use the strings in a loop

Hello all, I need some help to create a script which contain a few strings on every line, and use those strings in a loop to fire some commands. for exmaple the file looks like tom dave bill andy paul I want to read one line at a time and use it in loop like command tom command dave... (3 Replies)
Discussion started by: xboxer21
3 Replies

Featured Tech Videos