Search files in directory for keywords using bash


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Search files in directory for keywords using bash
# 1  
Old 01-11-2016
Search files in directory for keywords using bash

I have ~100 text files in a directory that I am trying to parse and output to a new file. I am looking for the words chr,start,stop,ref,alt in each of the files. Those fields should appear somewhere in those files. The first two fields of each new set of rows is also printed. Since this is on a windows os I used "path\to\folder' in the bash

example of files to search (each is a seperate file)
Code:
name1	1111	chr	start	stop	ref	alt	comment		
		1	10	25	a	t	snp		
		1	20	75	t	-	del		
		2	30	120	-	a	ins		
		10	10	80	a	g	snp		
name2	222	id	chr	start	stop	ref	alt	comment	
		1111	1	10	25	a	g	snp	
name3	333333	id	symbol	chr	start	stop	ref	alt	comment
		222	name	1	20	75	c	-	del
		222	name	2	30	120	-	t	ins

desired output
Code:
name1	1111	chr	start	stop	ref	alt
		1	10	25	a	t
		1	20	75	t	-
		2	30	120	-	a
		10	10	80	a	g
name2	222	chr	start	stop	ref	alt
		1	10	25	a	g
name3	333333	chr	start	stop	ref	alt
		1	20	75	c	-
		2	30	120	-	t

Thank you Smilie.

bash tried
Code:
for f in "C:\Users/test\Desktop\file\folder*.txt" ; do
        bname=${f##*/}
        pref=${bname%%.bam}
        awk "/chr/{found=1}/start/{if(found)/stop/{if(found)/ref/{if(found)/alt/{if(found) $f print > ${pref}_edit.txt
done

# 2  
Old 01-11-2016
Is this a free form input set of files? Are those spaces? Fixed width columns? Can columns run into each other for example, could you have:

Code:
start  stop
5500056000

where start is 55000 and stop is 56000

Just trying to clear up the unknowns...

Are column titles arbitrary, can the columns be in any order?

(I'm sure I could think of more things to ask)
# 3  
Old 01-11-2016
The input are excel xlsx files that I converted to text in VBA, so they should all be separated by a tab. The in input files are 133 individual text files with the column titles in random order. In some it will be chr,start,stop,ref,alt in others id,chr,start,stop,ref,alt and in others name,symbol,id,chr,start,stop,ref,alt. Does this help? Thank you Smilie.
# 4  
Old 01-11-2016
Are the lines with column titles always lines that begin with a non-whitespace character (e.g. name1)?
# 5  
Old 01-11-2016
Theoreticly, you could have saved them as CSV (Comma Seperated Values).
However, the IFS here seems to be NL/tab.

Your awk code is invalid.
Code:
1 ~/tmp $ LC_ALL=C sh ccmbade.sh 
awk: cmd. line:1: /chr/{found=1}/start/{if(found)/stop/{if(found)/ref/{if(found)/alt/{if(found) ccmbade.dat print > ccmbade.dat_edit.txt
awk: cmd. line:1:                                      ^ syntax error
awk: cmd. line:1: /chr/{found=1}/start/{if(found)/stop/{if(found)/ref/{if(found)/alt/{if(found) ccmbade.dat print > ccmbade.dat_edit.txt
awk: cmd. line:1:                                                     ^ syntax error
awk: cmd. line:1: /chr/{found=1}/start/{if(found)/stop/{if(found)/ref/{if(found)/alt/{if(found) ccmbade.dat print > ccmbade.dat_edit.txt
awk: cmd. line:1:                                                                    ^ syntax error
awk: cmd. line:1: /chr/{found=1}/start/{if(found)/stop/{if(found)/ref/{if(found)/alt/{if(found) ccmbade.dat print > ccmbade.dat_edit.txt
awk: cmd. line:1:                                                                                      ^ syntax error
awk: cmd. line:1: /chr/{found=1}/start/{if(found)/stop/{if(found)/ref/{if(found)/alt/{if(found) ccmbade.dat print > ccmbade.dat_edit.txt
awk: cmd. line:1:                                                                                                          ^ syntax error
awk: cmd. line:1: /chr/{found=1}/start/{if(found)/stop/{if(found)/ref/{if(found)/alt/{if(found) ccmbade.dat print > ccmbade.dat_edit.txt
awk: cmd. line:1:                                                                                                                       ^ unexpected newline or end of string

Having 3 different kinds of 'columns' doesnt really help.

You know, you dont have to use awk, you could use regular scripting?
If that is easier for you, that is.

This said, counts for me too, here is somethign to get you started:
Code:
for f in *.dat ; do
	bname=${f##*/}
	#pref=${bname%%.bam}	## dont have that
	#awk "/chr/{found=1}/start/{if(found)/stop/{if(found)/ref/{if(found)/alt/{if(found) $f print > ${pref}_edit.txt"
	while read content_line
	do
		if echo "$content_line" | grep -q ^name
		then	
			MODE="default"	# Reset parse mode
			echo "$content_line" | grep -v symbol | grep -q id && MODE=id
			echo "$content_line" | grep -q symbol && MODE=symbol
		fi
		
		case $MODE in
		default)	while read chr start stop ref alt comment;do
					line_print="$chr $start $stop $ref $alt $commet"
				done<<<"$content_line" ##>> ccmcbabe.output
				;;
		id)		echo "id handling"	;;
		symbol)		echo "symbol handling"	;;
		esac
		
		echo "$MODE :: $line_print"
	done < "$f"
done

hth
EDIT:
Which then outputs as:
Code:
sh ccmbade.sh 
default :: name1 1111 chr start stop 
default :: 1 10 25 a t 
default :: 1 20 75 t - 
default :: 2 30 120 - a 
default :: 10 10 80 a g 
id handling
id :: 10 10 80 a g 
id handling
id :: 10 10 80 a g 
symbol handling
symbol :: 10 10 80 a g 
symbol handling
symbol :: 10 10 80 a g 
symbol handling
symbol :: 10 10 80 a g 
0 ~/tmp $


Last edited by sea; 01-11-2016 at 05:11 PM..
# 6  
Old 01-11-2016
Yes, if the two below files were used name1,123 would be file 1 and name2,1234 would be file 2: does this help? Thank you Smilie.

Code:
 name1 123  chr  start  stop  ref  alt 
                      1    10     20    a    -
                      1    30    150  -     aaaa
 name2  1234 chr   start  stop  ref  alt
                       2     220   250   t     c

# 7  
Old 01-11-2016
Another way you could try:

Code:
BEGIN {
  FS=OFS="\t"
  header="chr,start,stop,ref,alt"
  n=split("x,x," header,H,",")
}
{
  split($0,F)
  if($1!="") for(i=3; i<=NF; i++) O[$i]=i
  $0=x
  $1=F[1]; $2=F[2]
  for(i=3; i<=n; i++) $i=F[O[H[i]]]
  print
}

Code:
cd path/to/folder
awk -f /path2/to/script *.txt > file.out

It might be too many files. Then you could:
Code:
for i in *.txt
do
  cat "$i"
done |
awk -f /path2/to/script > file.out

--
Output with sample:
Code:
name1	1111	chr	start	stop	ref	alt
		1	10	25	a	t
		1	20	75	t	-
		2	30	120	-	a
		10	10	80	a	g
name2	222	chr	start	stop	ref	alt
		1	10	25	a	g
name3	333333	chr	start	stop	ref	alt
		1	20	75	c	-
		2	30	120	-	t

Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Bash append values if keywords are present in the file

Hi Team, i have a web ui where user will be passing values and the output will be saved to a file say test with the following contents . These below mentioned values will change according to the user_input Just gave here one example Contents of file test is given below Gateway... (7 Replies)
Discussion started by: venkitesh
7 Replies

2. Shell Programming and Scripting

Perl - use search keywords from array and search a file and print 3rd field when matched

Hi , I have been trying to write a perl script to do this job. But i am not able to achieve the desired result. Below is my code. my $current_value=12345; my @users=("bob","ben","tom","harry"); open DBLIST,"<","/var/tmp/DBinfo"; my @input = <DBLIST>; foreach (@users) { my... (11 Replies)
Discussion started by: chidori
11 Replies

3. Shell Programming and Scripting

search between keywords and make a single line

have a very big file where need to format it like below example file: abcd today is great day; search keyword 'abcd' and append to it all words till we reach ; to make it a single line. output should look like. abcd today is great day; There are many occurrence of such... (2 Replies)
Discussion started by: giri4332
2 Replies

4. UNIX for Advanced & Expert Users

Need to search for keywords within files modified at a certain time

I have a huge list of files in an Unix directory (around 10000 files). I need to be able to search for a certain keyword only within files that are modified between certain date and time, say for e.g 2012-08-20 12:30 to 2012-08-20 12:40 Can someone let me know what would be the fastest way... (10 Replies)
Discussion started by: virtual123
10 Replies

5. Shell Programming and Scripting

How to recursively search for a list of keywords in a given directory?

Hi all, how to recursively search for a list of keywords in a given directory?? for example: suppose i have kept all the keywords in a file called "procnamelist" (in separate line) and i have to search recursively in a directory called "target/dir" if i am not doing recursive search then... (4 Replies)
Discussion started by: neelmani
4 Replies

6. Shell Programming and Scripting

Search a file with keywords

Hi All I have a file of format asdf asf first sec endi asdk rt 123 ferf dfg ijglkp (7 Replies)
Discussion started by: mailabdulbari
7 Replies

7. Shell Programming and Scripting

How to search for keywords in subsequent lines

Hi all, I am looking for a coomand to search for the keywords in susequenct lines. Keyword1 in a line and Keyword2 in the very next line. Once i found the combination ineed to print the lines with patterns and the line above and one below. I am giving an example here: Keywords are :ERROR and... (12 Replies)
Discussion started by: rdhanek
12 Replies

8. Shell Programming and Scripting

Search and replace words between two keywords

Hi, I have a file which contains the following : select * from test where test_id=1; select id from test1, test2 where test_id=1 and test_id=2; select * from test1, test2, test3 where test_id=4 and test2_id where in (select test2_id from test2); select id1, id2 from test ... (6 Replies)
Discussion started by: vrrajeeb
6 Replies

9. Shell Programming and Scripting

search all files and sub directory

I wanted to search in all the sub directories under /vob/project (recurse) in everything inside /vob/project. search.run for x in `cat search.strings` do find /vob/project -type f -print | xargs grep -i $x > ~/$x.txt done search.string hello whoami I am getting the error ... (5 Replies)
Discussion started by: siva_jm
5 Replies
Login or Register to Ask a Question