Search a column a return a set of words


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Search a column a return a set of words
# 1  
Old 09-21-2015
Search a column a return a set of words

Hi

I have two files. One is a text file consisting of sentences i.e. INPUT.txt and the second file is SEARCH.txt consisting of two or three columns. I need help to write a script to search the second column of SEARCH.txt for each set of five words (blue color as set one and green color as set two and red color as set three and so on) of each sentence from the INPUT.txt file. The search condition is to find one set of five words from the second column of SEARCH.txt which match atleast four words from the set of five words from the input sentence and return that set of five words from SEARCH.txt whose corresponding value on the first column is the smallest. [e.g. assumming -2.922845 is bigger than -2.927181]. The search is to be carried out for each set of five words. If there is less than five words in the sentence, the search must stop. Assuming that the columns of SEARCH.txt are separated by tab.

Format of INPUT.txt file.
Code:
hai wafam cherol makha palli adubu madu ma yaakhidre haikhre tamlakle .
mahak aroiba yaahip tankhi hai machagi matamda saramba gatetu kaikhere mahakkisu aroiba yaahip tankhi hai  haikhre .

Format of SEARCH.txt file.

Code:
-0.9725326      arna thamlamba nongchup santhong gani -0.014587925
-0.9777407      tainaba amanba yamna uningdraba  -0.014587925
-0.9700631      aeroplane adu indira parktara ama     -0.014587925
-1.2438936      mahakki aroiba yaahip tankhi hai -0.014587925
-0.97742474     aroiba yaahip tankhi hai hairi    -0.014587925
-1.391722       hai wafam cherolna makha palli     -0.6328273
-2.922845       hai wafam cherolduna makha palli -0.1190167
-2.915667       hai wafam cherolsina makha palli  -0.5702463
-2.927181       hai wafam paochena makha palli  -0.1963889
-2.925497       hai wafam khangnaduna   -0.6328273
-2.855543       hai wafam ngasigi 
-2.926619       hai wafam thamkharabani
-1.635051       hai wafam thamlamle    -0.4567362
-1.078001       hai wafam thamlamli    -0.8960688
-1.023442       adubu madu makhada yaakhidre haikhre -0.1234433
-1.432234       adubu madu makha yaakhidre haikhre  -0.5432345
-1.1278934      changangei air fieldda hongdok pikhraga   -0.014587925
-0.9567379      nupa machagi matamda saramba gatetu     -0.014587925
-0.5984392      machagi matamda saramba gatetu kaire       -0.014587925
-1.250842       leiriba aduda santri khara thamkhre        -0.014587925

The expected format of OUTPUT.txt is given below.


Code:
hai wafam paochena makha palli adubu madu makha yaakhidre haikhre tamlakle.
mahakki aroiba yaahip tankhi hai nupa machagi matamda saramba gatetu mahakki aroiba yaahip tankhi hai haikhre

Thanks in advance Smilie.

Last edited by my_Perl; 09-21-2015 at 10:47 PM.. Reason: Editing
# 2  
Old 09-22-2015
Try
Code:
awk  '

NR==FNR {if (5 == split ($2, T, " ")) PAT[$2]=$1
         next
        }

        {for (j=0; j<NF; j+=5)  {TMP = ""
                                 MIN = 1E100
                                 for (p in PAT) {CNT=0
                                                 split(p, X, " ")
                                                 for (i=1+j; i<=5+j; i++)
                                                    for (k=1; k<=5; k++) if ($i == X[k]) CNT++
                                                 if (CNT >= 4 && PAT[p] < MIN)  {MIN=PAT[p]
                                                                                 TMP=p
                                                                                }
                                                }
                                 if (TMP)        printf "%s ", TMP
                                 else            printf "%s %s %s %s %s ", $(j+1), $(j+2), $(j+3),P $(j+4), $(j+5)
                                }
         printf "\n"
        }
' FS="\t" OFS="\t" SEARCH.txt  FS=" " INPUT.txt
hai wafam paochena makha palli adubu madu makha yaakhidre haikhre tamlakle .    
mahakki aroiba yaahip tankhi hai nupa machagi matamda saramba gatetu mahakki aroiba yaahip tankhi hai haikhre .

This User Gave Thanks to RudiC For This Post:
# 3  
Old 09-27-2015
Hi

I tried running this awk script. It worked fine for small size of SEARCH.txt. But, when it comes to large size consisting of 10 millions lines (tuples), I am unable to get any output. Please advice me how do I go ahead. Thanks in advance Smilie
# 4  
Old 10-06-2015
Hi

I need help to write the regular expression if the column separator between the first and the second columns are two possible cases,

(1) in the order of -one blank space and followed by a tab for some cases, and
(2) in the order of - a tab and followed by one blank space for some cases


Thanks in advance Smilie
# 5  
Old 10-06-2015
How about adding {sub (/ | /, "\t"); before splitting $2 in "SEARCH.TXT"?
This User Gave Thanks to RudiC For This Post:
# 6  
Old 10-06-2015
Hi


I did

Code:
awk ' NR==FNR  {sub (/| /, "\t");  if (5 == split ($2, T, " ")) PAT[$2]=$1
         next
}

Please correct me if I am wrong.

Last edited by my_Perl; 10-06-2015 at 01:49 PM.. Reason: Editing
# 7  
Old 10-06-2015
Did you use <space><TAB>|<TAB><space> in the sub call?

But, I found that even though that replaced your field separator patterns with single <TAB>s and fixed fields 1 and/or 2 by removing spaces, it wouldn't change the operation of the script dramatically.
This User Gave Thanks to RudiC For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Search words in any quote position and then change the words

hi, i need to replace all words in any quote position and then need to change the words inside the file thousand of raw. textfile data : "Ninguno","Confirma","JuicioABC" "JuicioCOMP","Recurso","JuicioABC" "JuicioDELL","Nulidad","Nosino" "Solidade","JuicioEUR","Segundo" need... (1 Reply)
Discussion started by: benjietambling
1 Replies

2. Shell Programming and Scripting

Help needed with shell script to search and replace a set of strings among the set of files

Hi, I am looking for a shell script which serves the below purpose. Please find below the algorithm for the same and any help on this would be highly appreciated. 1)set of strings need to be replaced among set of files(directory may contain different types of files) 2)It should search for... (10 Replies)
Discussion started by: Amulya
10 Replies

3. UNIX for Dummies Questions & Answers

How to return only specific words in a line?

Hi, I'm a bash newbie and have a data set like this idxxx1 something something marker_id_132=rsxxx;marker_id_135=rsxxx idxxx2 something something marker_id_132=rsxxx;marker_id_135=rsxxx idxxx3 something something marker_id_132=rsxxx;marker_id_135=rsxxx ... ... (7 Replies)
Discussion started by: hanhel
7 Replies

4. Shell Programming and Scripting

search a string in a particular column of file and return the line number of the line

Hi All, Can you please guide me to search a string in a particular column of file and return the line number of the line where it was found using awk. As an example : abc.txt 7000,john,2,1,0,1,6 7001,elen,2,2,0,1,7 7002,sami,2,3,0,1,6 7003,mike,1,4,0,2,1 8001,nike,1,5,0,1,8... (3 Replies)
Discussion started by: arunshankar.c
3 Replies

5. Shell Programming and Scripting

Finding compound words from a set of files from another set of files

Hi All, I am completely stuck here. I have a set of files (with names A.txt, B.txt until L.txt) which contain words like these: computer random access memory computer networking mouse terminal windows All the files from A.txt to L.txt have the same format i.e. complete words in... (2 Replies)
Discussion started by: shoaibjameel123
2 Replies

6. UNIX for Dummies Questions & Answers

Adding words after a set of words

Greetings. I am a UNIX newbies. I am currently facing difficulties dealing with a large data set and I would like to ask for helps. I have a input file like this: ak 1 AAM1 ak 2 AAM1 ak 3 AAM1 ak 11 AMM2 ak 12 AMM2 ak 13 AMM2 ak 14 AMM2 Is there any possibility for me to... (7 Replies)
Discussion started by: Amanda Low
7 Replies

7. Shell Programming and Scripting

search of common words in set of files

Hi, I have a set of simple, one columned text files (in thousands). file1: a b c d file 2: b c d e and so on. There is a collection of words in another file: b d b c d e I have to find out the set of words (in each row) is present or absent in the given set of files. So, the... (4 Replies)
Discussion started by: mala
4 Replies

8. Shell Programming and Scripting

String search and return value from column

Dear All I had below mention file as my input file. 87980457 Jan 12 2008 2:00AM 1 60 BSC1 81164713 Jan 12 2008 3:00AM 1 60 BSC2 78084521 Jan 12 2008 4:00AM 1 60 BSC3 68385193... (3 Replies)
Discussion started by: jaydeep_sadaria
3 Replies

9. UNIX for Dummies Questions & Answers

return a word between two words

how do i get a word that exists between two words eg: this is bryan My input to command would be this and bryan and output should be 'is' Is there a command i can use? (4 Replies)
Discussion started by: bryan
4 Replies
Login or Register to Ask a Question