Word matching and write other data


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Word matching and write other data
# 1  
Old 08-07-2012
Word matching and write other data

Hi all,

I have 7 words

Code:
CAD
CD
HT
RA
T1D
T2D
BD

Now I have 1 file which contain data in large number of rows and columns

from 2 nd column onwards it contain any of these 7 words or may be more than one words among above 7 words:


these 7 names are present in starting from 2nd column

means the file contain any of these 7 words in coulmns 2,3,4,5,6,7,89,10...

It's a big file in terms of columns not rows.

our input is

Code:
CYP1B1  (PA27094)	paclitaxel (PA450761) RA (PA27094)	docetaxel (PA449383) RA (PA27094)	 RA (PA27094)	capecitabine (PA448771);cisplatin (PA449014);docetaxel (PA449383);epirubicin (PA449476);gemcitabine (PA449748) RA (PA27094)	capecitabine (PA448771);cisplatin (PA449014);docetaxel (PA449383);epirubicin (PA449476);gemcitabine (PA449748) RA								
HLA-DRA  (PA35071)	 T1D,T1D,T1D,T1D,T1D,T1D,T1D,T1D,T1D,T1D,T1D,RA,RA,RA,RA,RA,RA,RA,RA,RA,RA,RA,RA,RA												
ESR1  (PA156)	 T2D,BD (PA156)	conjugated estrogens (PA164754789);medroxyprogesterone (PA450344) T2D,BD (PA156)	Alkylating Agents (PA164712331);cisplatin (PA449014) T2D,BD (PA156)	tamoxifen (PA451581) T2D,BD									
HTR1A  (PA192)	paroxetine (PA450801);sertraline (PA451333) CAD,CD (PA192)	antidepressants (PA452229) CAD,CD (PA192)	antidepressants (PA452229) CAD,CD										
HTR1B  (PA29549)	paroxetine (PA450801) CD (PA29549)	clomipramine (PA449048);liothyronine (PA164778866);Lithium (PA164712869);nefazodone (PA450603);venlafaxine (PA451866) CD											
CHST3  (PA26503)	docetaxel (PA449383);thalidomide (PA451644) T2D,T2D (PA26503)	docetaxel (PA449383);thalidomide (PA451644) T2D,T2D (PA26503)	docetaxel (PA449383);thalidomide (PA451644) T2D,T2D										
HTR6  (PA29560)	atorvastatin (PA448500);pravastatin (PA451089);simvastatin (PA451363) T1D												
HTR7  (PA29561)	atorvastatin (PA448500);pravastatin (PA451089);simvastatin (PA451363) HT,HT (PA29561)	atorvastatin (PA448500);pravastatin (PA451089);simvastatin (PA451363) HT,HT (PA29561)	atorvastatin (PA448500);pravastatin (PA451089);simvastatin (PA451363) HT,HT										
ALDH3A1  (PA24697)	carboplatin (PA448803);cyclophosphamide (PA449165);thiotepa (PA451668) BD,BD												
ALDH3A2  (PA24698),SLC47A2 (PA162403847)	 BD,BD												
DRD1  (PA147)	bupropion (PA448687);nicotine (PA450626) HT,HT,HT,HT (PA147)	 HT,HT,HT,HT (PA147)	 HT,HT,HT,HT (PA147)	drotrecogin alfa (PA131548935) HT,HT,HT,HT									
NCF4  (PA31465)	doxorubicin (PA449412) RA

I want the output shuld be 7 files with separate data for HT, T1D,T2D onwards

For eg RA file will surely contain in output

Code:
CYP1B1  (PA27094)	paclitaxel (PA450761)  (PA27094)	docetaxel (PA449383)  capecitabine (PA448771);cisplatin (PA449014);docetaxel (PA449383);epirubicin (PA449476);gemcitabine (PA449748) 	capecitabine (PA448771);cisplatin (PA449014);docetaxel (PA449383);epirubicin (PA449476);gemcitabine (PA449748)

and other more entires in the same way for other 6 output files.


Moderator's Comments:
Mod Comment Please use code tags next time for your code and data, NOT quote tags... thanks.

Last edited by zaxxon; 08-07-2012 at 08:33 AM.. Reason: code tags
# 2  
Old 08-07-2012
please specify how you need output
suppose if a column has any of the 7 words then all the columns need to print before that column or something else
# 3  
Old 08-07-2012
Yes all the column has to be printed.

Like in this out put of RA

all the columns are printed
and it shuld not repeat "the word" and the entries of first column in any other column like in this output firs column contain

CYP1B1 (PA27094)
and the number (PA27094)is nto repeated again in any of the column as well as RA is not repeated compared to input.

Code:
CYP1B1  (PA27094)	paclitaxel (PA450761)  docetaxel (PA449383)  capecitabine (PA448771);cisplatin (PA449014);docetaxel (PA449383);epirubicin (PA449476);gemcitabine (PA449748) 	capecitabine (PA448771);cisplatin (PA449014);docetaxel (PA449383);epirubicin (PA449476);gemcitabine (PA449748)


Actually if possible same column should not be repeated as well like here in output epirubicin (PA449476) is repeated although I dont want this, but I can manage with this.

Last edited by zaxxon; 08-07-2012 at 08:52 AM.. Reason: code tags, not quote tags. start reading PMs and comments. At 20 points you will be set to read only.
# 4  
Old 08-08-2012
Code:
awk '
BEGIN{
	f=0;
	i=1;
	while (getline < "7wordfile")
		{
			a[$1]++;b[i++]=$1
		}
}
{
	for (j=1;j<=NF;j++)
		{
			n=split($j,c,",");
			for(k=1;k<=n;k++)
				{
					if(a[c[k]])
						{
							a[c[k]]=2;
							f=1
						}
				};
			if(f==1)
				{
					$j=x;
					f=0
				}
		};
	for(k=1;k<i;k++)
		{
			if(a[b[k]]==2)
				{
					a[b[k]]=1;
					print > b[k]
				}
		}
}' inputfile

For output see the file with names from the 7words in your current directory
for example RA,T1D etc
# 5  
Old 08-08-2012
Thaqnks a lot its wrking good but its not showing result as in inut with porper spacing between words

the below mentioned output file for eg.T2D doesnot contain proper spacing in form of columns as input file contain otherwise everythign is fine please chek it once

Quote:
ESR1 (PA156) leflunomide (PA450192) (PA156) leflunomide (PA450192)
CHST3 (PA26503) docetaxel (PA449383);thalidomide (PA451644) (PA26503) docetaxel (PA449383);thalidomide (PA451644) (PA26503) docetaxel (PA449383);thalidomide (PA451644) (PA26503) docetaxel (PA449383);thalidomide (PA451644) (PA26503) docetaxel (PA449383);thalidomide (PA451644) (PA26503) docetaxel (PA449383);thalidomide (PA451644) (PA26503) docetaxel (PA449383);thalidomide (PA451644)
LPL (PA232) fenofibrate (PA449594)
GALNT14 (PA134920089) cisplatin (PA449014);fluorouracil (PA128406956);mitoxantrone (PA450526) (PA134920089) cisplatin (PA449014);fluorouracil (PA128406956);mitoxantrone (PA450526) (PA134920089) cisplatin (PA449014);fluorouracil (PA128406956);mitoxantrone (PA450526) (PA134920089) cisplatin (PA449014);fluorouracil (PA128406956);mitoxantrone (PA450526) (PA134920089) cisplatin (PA449014);fluorouracil (PA128406956);mitoxantrone (PA450526) (PA134920089) cisplatin (PA449014);fluorouracil (PA128406956);mitoxantrone (PA450526) (PA134920089) cisplatin (PA449014);fluorouracil (PA128406956);mitoxantrone (PA450526)
CTLA4 (PA27006) glatiramer acetate (PA449760)
CYP1A2 (PA27093) clozapine (PA449061) (PA27093) clozapine (PA449061) (PA27093) paroxetine (PA450801) (PA27093) leflunomide (PA450192) (PA27093) leflunomide (PA450192) (PA27093) leflunomide (PA450192) (PA27093) theophylline (PA451647) (PA27093) caffeine (PA448710) (PA27093) caffeine (PA448710) (PA27093) clozapine (PA449061) (PA27093) caffeine (PA448710) (PA27093) caffeine (PA448710) (PA27093) caffeine (PA448710) (PA27093) caffeine (PA448710) (PA27093) warfarin (PA451906) (PA27093) warfarin (PA451906) (PA27093) warfarin (PA451906)
below is my input which aontain data in columns properly

CYP1B1 (PA27094) paclitaxel (PA450761) RA (PA27094) docetaxel (PA449383)
Code:
RA (PA27094)	 RA (PA27094)	capecitabine (PA448771);cisplatin (PA449014);docetaxel (PA449383);epirubicin (PA449476);gemcitabine (PA449748) RA (PA27094)	capecitabine (PA448771);cisplatin (PA449014);docetaxel (PA449383);epirubicin (PA449476);gemcitabine (PA449748) RA								
HLA-DRA  (PA35071)	 T1D,T1D,T1D,T1D,T1D,T1D,T1D,T1D,T1D,T1D,T1D,RA,RA,RA,RA,RA,RA,RA,RA,RA,RA,RA,RA,RA												
ESR1  (PA156)	 T2D,BD (PA156)	conjugated estrogens (PA164754789);medroxyprogesterone (PA450344) T2D,BD (PA156)	Alkylating Agents (PA164712331);cisplatin (PA449014) T2D,BD (PA156)	tamoxifen (PA451581) T2D,BD									
HTR1A  (PA192)	paroxetine (PA450801);sertraline (PA451333) CAD,CD (PA192)	antidepressants (PA452229) CAD,CD (PA192)	antidepressants (PA452229) CAD,CD										
HTR1B  (PA29549)	paroxetine (PA450801) CD (PA29549)	clomipramine (PA449048);liothyronine (PA164778866);Lithium (PA164712869);nefazodone (PA450603);venlafaxine (PA451866) CD											
CHST3  (PA26503)	docetaxel (PA449383);thalidomide (PA451644) T2D,T2D (PA26503)	docetaxel (PA449383);thalidomide (PA451644) T2D,T2D (PA26503)	docetaxel (PA449383);thalidomide (PA451644) T2D,T2D										
HTR6  (PA29560)	atorvastatin (PA448500);pravastatin (PA451089);simvastatin (PA451363) T1D												
HTR7  (PA29561)	atorvastatin (PA448500);pravastatin (PA451089);simvastatin (PA451363) HT,HT (PA29561)	atorvastatin (PA448500);pravastatin (PA451089);simvastatin (PA451363) HT,HT (PA29561)	atorvastatin (PA448500);pravastatin (PA451089);simvastatin (PA451363) HT,HT										
ALDH3A1  (PA24697)	carboplatin (PA448803);cyclophosphamide (PA449165);thiotepa (PA451668) BD,BD												
ALDH3A2  (PA24698),SLC47A2 (PA162403847)	 BD,BD												
DRD1  (PA147)	bupropion (PA448687);nicotine (PA450626) HT,HT,HT,HT (PA147)	 HT,HT,HT,HT (PA147)	 HT,HT,HT,HT (PA147)	drotrecogin alfa (PA131548935) HT,HT,HT,HT									
NCF4  (PA31465)	doxorubicin (PA449412) RA


Last edited by manigrover; 08-08-2012 at 08:45 AM..
# 6  
Old 08-09-2012
How should this look in the HT file?
Code:
DRD1  (PA147)	bupropion (PA448687);nicotine (PA450626) HT,HT,HT,HT (PA147)	 HT,HT,HT,HT (PA147)	 HT,HT,HT,HT (PA147)	drotrecogin alfa (PA131548935) HT,HT,HT,HT

# 7  
Old 08-09-2012
column can be space separated or what you required
how you want please let me know
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Extract First and matching word from string in UNIX

Thank you (2 Replies)
Discussion started by: Pratik Majithia
2 Replies

2. UNIX for Dummies Questions & Answers

List all files with sum of matching word

grep -c 'avihai' 1.log will give me count of 'avihai' in log I want to have a list of files in the folder that show file name with count side by side. Please advice (2 Replies)
Discussion started by: avihaimar
2 Replies

3. UNIX for Dummies Questions & Answers

Match columns and write specific word

Hi all I have another question as of now. I have two files One file contain data like this Serendipity glamerus Shenpurity In another file these entries are present in different columns like this from 2 column onwards SRN Serendipity Non serendipity ... (1 Reply)
Discussion started by: Priyanka Chopra
1 Replies

4. Shell Programming and Scripting

How to Print from matching word to end using awk

Team, Could some one help me in Printing from matching word to end using awk For ex: Input: I am tester for now I am tester yesterday I am tester tomorrow O/p tester for now tester yesterday tester tomorrow i.e Starting from tester till end of sentence (5 Replies)
Discussion started by: mallak
5 Replies

5. Shell Programming and Scripting

To read data word by word from given file & storing in variables

File having data in following format : file name : file.txt -------------------- 111111;name1 222222;name2 333333;name3 I want to read this file so that I can split these into two paramaters i.e. 111111 & name1 into two different variables(say value1 & value2). i.e val1=11111 &... (2 Replies)
Discussion started by: sjoshi98
2 Replies

6. Shell Programming and Scripting

How to remove all words from a matching word in a line?

Hi Guys, :p I have a file like this: 2010-04-25 00:00:30,095 INFO - ]- start process U100M4 2010-04-25 00:00:30,096 DEBUG - ] -- call EJB 2010-04-25 00:00:30,709 INFO - - end processU100M4 2010-04-25 00:00:30,710 DEBUG - got message=Sorry I want to out put format. 2010-04-25... (5 Replies)
Discussion started by: ooilinlove
5 Replies

7. UNIX for Dummies Questions & Answers

grep only word matching the pattern

Hi gurus, A file contains many words in format "ABC.XXXX.XXXX.X.GET.LOG" (X->varying). Now my shell script want this list (only words in formatABC.XXXX.XXXX.X.GET.LOG ) to continue the process. Pls help me. Thanks, Poova. (8 Replies)
Discussion started by: poova
8 Replies

8. Shell Programming and Scripting

Extracting the strings matching a pattern from a word

Hi All , I need to extract the strings that are matching with the pattern : CUST.<AnyStringOfAnyLength>.<AnyStringOfAnyLength> from a file and then write all these string into another file. e.g. If a file SOURCE contains following lines : IF(CUST.ABCD.EFGH==1) THEN CUST.ABCD.EFGH =... (7 Replies)
Discussion started by: swapnil.nawale
7 Replies

9. Shell Programming and Scripting

help needed .. Unable to write the data to new file after matching the pattern

Hi, i am pretty new to Unix environment ..... Can i get some help from any of you guyz on writing Unix script. my requirement is like reading a csv file, finding a specific pattern in the lines and repalce the string with new string and write it to another file. My file is file ABC123.dat... (3 Replies)
Discussion started by: prashant_jsw
3 Replies

10. Shell Programming and Scripting

matching a letter in a word

hi, if i have a string of letters and seperatly i have a single letter. how do i check whether that specific letter is in my string aswell? any ideas? (2 Replies)
Discussion started by: Furqan_79
2 Replies
Login or Register to Ask a Question