Search files in directory for keywords using bash

01-11-2016

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

Search files in directory for keywords using bash

I have ~100 text files in a directory that I am trying to parse and output to a new file. I am looking for the words chr,start,stop,ref,alt in each of the files. Those fields should appear somewhere in those files. The first two fields of each new set of rows is also printed. Since this is on a windows os I used "path\to\folder' in the bash

example of files to search (each is a seperate file)

Code:

name1	1111	chr	start	stop	ref	alt	comment		
		1	10	25	a	t	snp		
		1	20	75	t	-	del		
		2	30	120	-	a	ins		
		10	10	80	a	g	snp		
name2	222	id	chr	start	stop	ref	alt	comment	
		1111	1	10	25	a	g	snp	
name3	333333	id	symbol	chr	start	stop	ref	alt	comment
		222	name	1	20	75	c	-	del
		222	name	2	30	120	-	t	ins

desired output

Code:

name1	1111	chr	start	stop	ref	alt
		1	10	25	a	t
		1	20	75	t	-
		2	30	120	-	a
		10	10	80	a	g
name2	222	chr	start	stop	ref	alt
		1	10	25	a	g
name3	333333	chr	start	stop	ref	alt
		1	20	75	c	-
		2	30	120	-	t

Thank you

.

bash tried

Code:

for f in "C:\Users/test\Desktop\file\folder*.txt" ; do
        bname=${f##*/}
        pref=${bname%%.bam}
        awk "/chr/{found=1}/start/{if(found)/stop/{if(found)/ref/{if(found)/alt/{if(found) $f print > ${pref}_edit.txt
done

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

01-11-2016

Registered User

614, 110

Join Date: May 2005

Last Activity: 27 June 2016, 2:12 PM EDT

Posts: 614

Thanks Given: 4

Thanked 110 Times in 107 Posts

Is this a free form input set of files? Are those spaces? Fixed width columns? Can columns run into each other for example, could you have:

Code:

start  stop
5500056000

where start is 55000 and stop is 56000

Just trying to clear up the unknowns...

Are column titles arbitrary, can the columns be in any order?

(I'm sure I could think of more things to ask)

cjcox

View Public Profile for cjcox

Find all posts by cjcox

01-11-2016

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

The input are excel xlsx files that I converted to text in VBA, so they should all be separated by a tab. The in input files are 133 individual text files with the column titles in random order. In some it will be chr,start,stop,ref,alt in others id,chr,start,stop,ref,alt and in others name,symbol,id,chr,start,stop,ref,alt. Does this help? Thank you

.

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

01-11-2016

Registered User

614, 110

Join Date: May 2005

Last Activity: 27 June 2016, 2:12 PM EDT

Posts: 614

Thanks Given: 4

Thanked 110 Times in 107 Posts

Are the lines with column titles always lines that begin with a non-whitespace character (e.g. name1)?

cjcox

View Public Profile for cjcox

Find all posts by cjcox

01-11-2016

Registered User

1,416, 266

Join Date: Sep 2013

Last Activity: 13 January 2021, 9:37 AM EST

Location: Swissh

Posts: 1,416

Thanks Given: 328

Thanked 266 Times in 239 Posts

Theoreticly, you could have saved them as CSV (Comma Seperated Values).
However, the IFS here seems to be NL/tab.

Your awk code is invalid.

Code:

1 ~/tmp $ LC_ALL=C sh ccmbade.sh 
awk: cmd. line:1: /chr/{found=1}/start/{if(found)/stop/{if(found)/ref/{if(found)/alt/{if(found) ccmbade.dat print > ccmbade.dat_edit.txt
awk: cmd. line:1:                                      ^ syntax error
awk: cmd. line:1: /chr/{found=1}/start/{if(found)/stop/{if(found)/ref/{if(found)/alt/{if(found) ccmbade.dat print > ccmbade.dat_edit.txt
awk: cmd. line:1:                                                     ^ syntax error
awk: cmd. line:1: /chr/{found=1}/start/{if(found)/stop/{if(found)/ref/{if(found)/alt/{if(found) ccmbade.dat print > ccmbade.dat_edit.txt
awk: cmd. line:1:                                                                    ^ syntax error
awk: cmd. line:1: /chr/{found=1}/start/{if(found)/stop/{if(found)/ref/{if(found)/alt/{if(found) ccmbade.dat print > ccmbade.dat_edit.txt
awk: cmd. line:1:                                                                                      ^ syntax error
awk: cmd. line:1: /chr/{found=1}/start/{if(found)/stop/{if(found)/ref/{if(found)/alt/{if(found) ccmbade.dat print > ccmbade.dat_edit.txt
awk: cmd. line:1:                                                                                                          ^ syntax error
awk: cmd. line:1: /chr/{found=1}/start/{if(found)/stop/{if(found)/ref/{if(found)/alt/{if(found) ccmbade.dat print > ccmbade.dat_edit.txt
awk: cmd. line:1:                                                                                                                       ^ unexpected newline or end of string

Having 3 different kinds of 'columns' doesnt really help.

You know, you dont have to use awk, you could use regular scripting?
If that is easier for you, that is.

This said, counts for me too, here is somethign to get you started:

Code:

for f in *.dat ; do
	bname=${f##*/}
	#pref=${bname%%.bam}	## dont have that
	#awk "/chr/{found=1}/start/{if(found)/stop/{if(found)/ref/{if(found)/alt/{if(found) $f print > ${pref}_edit.txt"
	while read content_line
	do
		if echo "$content_line" | grep -q ^name
		then	
			MODE="default"	# Reset parse mode
			echo "$content_line" | grep -v symbol | grep -q id && MODE=id
			echo "$content_line" | grep -q symbol && MODE=symbol
		fi
		
		case $MODE in
		default)	while read chr start stop ref alt comment;do
					line_print="$chr $start $stop $ref $alt $commet"
				done<<<"$content_line" ##>> ccmcbabe.output
				;;
		id)		echo "id handling"	;;
		symbol)		echo "symbol handling"	;;
		esac
		
		echo "$MODE :: $line_print"
	done < "$f"
done

hth
EDIT:
Which then outputs as:

Code:

sh ccmbade.sh 
default :: name1 1111 chr start stop 
default :: 1 10 25 a t 
default :: 1 20 75 t - 
default :: 2 30 120 - a 
default :: 10 10 80 a g 
id handling
id :: 10 10 80 a g 
id handling
id :: 10 10 80 a g 
symbol handling
symbol :: 10 10 80 a g 
symbol handling
symbol :: 10 10 80 a g 
symbol handling
symbol :: 10 10 80 a g 
0 ~/tmp $

Last edited by sea; 01-11-2016 at 05:11 PM..

sea

View Public Profile for sea

Find all posts by sea

01-11-2016

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

Yes, if the two below files were used name1,123 would be file 1 and name2,1234 would be file 2: does this help? Thank you

.

Code:

 name1 123  chr  start  stop  ref  alt 
                      1    10     20    a    -
                      1    30    150  -     aaaa
 name2  1234 chr   start  stop  ref  alt
                       2     220   250   t     c

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

01-11-2016

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

Another way you could try:

Code:

BEGIN {
  FS=OFS="\t"
  header="chr,start,stop,ref,alt"
  n=split("x,x," header,H,",")
}
{
  split($0,F)
  if($1!="") for(i=3; i<=NF; i++) O[$i]=i
  $0=x
  $1=F[1]; $2=F[2]
  for(i=3; i<=n; i++) $i=F[O[H[i]]]
  print
}

Code:

cd path/to/folder
awk -f /path2/to/script *.txt > file.out

It might be too many files. Then you could:

Code:

for i in *.txt
do
  cat "$i"
done |
awk -f /path2/to/script > file.out

--
Output with sample:

Code:

name1	1111	chr	start	stop	ref	alt
		1	10	25	a	t
		1	20	75	t	-
		2	30	120	-	a
		10	10	80	a	g
name2	222	chr	start	stop	ref	alt
		1	10	25	a	g
name3	333333	chr	start	stop	ref	alt
		1	20	75	c	-
		2	30	120	-	t

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

Shell Programming and Scripting

Search files in directory for keywords using bash

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Bash append values if keywords are present in the file

Discussion started by: venkitesh

2. Shell Programming and Scripting

Perl - use search keywords from array and search a file and print 3rd field when matched

Discussion started by: chidori

3. Shell Programming and Scripting

search between keywords and make a single line

Discussion started by: giri4332

4. UNIX for Advanced & Expert Users

Need to search for keywords within files modified at a certain time

Discussion started by: virtual123

5. Shell Programming and Scripting

How to recursively search for a list of keywords in a given directory?

Discussion started by: neelmani

6. Shell Programming and Scripting

Search a file with keywords

Discussion started by: mailabdulbari

7. Shell Programming and Scripting

How to search for keywords in subsequent lines

Discussion started by: rdhanek

8. Shell Programming and Scripting

Search and replace words between two keywords

Discussion started by: vrrajeeb

9. Shell Programming and Scripting

search all files and sub directory

Discussion started by: siva_jm