Script to find string based on pattern and search for its corresponding rows in column


Login or Register for Dates, Times and to Reply

 
Thread Tools Search this Thread
# 1  
Script to find string based on pattern and search for its corresponding rows in column

Experts,
Need your support for this awk script.

we have only one input file, all these column 1 and column 2 are in same file and have to do lookup for values in one file(column1 and column2) but output we need in another file
Need to grep row whose string contains 9K from column 1. When found match, grep that row suppose(BGL_0BC_901_1AG_A_CASR9KTR176) and pick corresponding rows form column 2 for that column 1 matched string.
If column2 contains 5 rows, then pick each value one by one and search in column1 and grep its corresponding row data from column 2 and prepare new output file in below format.
In output file all lines should start with columns value/string containing 9K

Code:
BGL_0BC_901_1AG_A_CASR9KTR176,BGL_CHT_903_1AC_B_CASR920R879,BGL_BAM_912_2AC_B_CASR920R879,BGL_BAM_912_1AC_B_CASR920R879,BGL_BAM_907_1AC_B_CASR920R879

For example :

if strings in rows of column1 contains 9K. Suppose we find (BGL_0BC_901_1AG_A_CASR9KTR176) in column 1. Now pick it's corresponding row value(BGL_KMR_919_1AC_B_CASR920R899) from column 2 and look in column1

Code:
column1,column2
BGL_0BC_901_1AG_A_CASR9KTR176,BGL_KMR_919_1AC_B_CASR920R899
BGL_0BC_901_1AG_A_CASR9KTR176,BGL_CHT_903_1AC_B_CASR920R879
BGL_0BC_901_1AG_A_CASR9KTR176,BGL_0UT_901_1AC_CASR903R551
BGL_0BC_901_1AG_A_CASR9KTR176,BGL_YOT_919_1AC_CASR903R458
BGL_0BC_901_1AG_A_CASR9KTR176,BGL_0BC_901_1AC_T_CASR920R504
BGL_CHT_903_1AC_B_CASR920R879,BGL_BAM_910_1AC_B_CASR920R879
BGL_BAM_910_1AC_B_CASR920R879,BGL_CHT_903_1AC_B_CASR920R879
BGL_BAM_910_1AC_B_CASR920R879,BGL_BAM_912_2AC_B_CASR920R879
BGL_BAM_912_2AC_B_CASR920R879,BGL_BAM_910_1AC_B_CASR920R879
BGL_BAM_912_2AC_B_CASR920R879,BGL_BAM_912_1AC_B_CASR920R879
BGL_BAM_912_1AC_B_CASR920R879,BGL_BAM_907_1AC_B_CASR920R879
BGL_BAM_912_1AC_B_CASR920R879,BGL_BAM_912_2AC_B_CASR920R879
BGL_2BC_901_1AG_A_CASR9KTR124,BGL_BGM_908_1AC_CASR903R173
BGL_2BC_901_1AG_A_CASR9KTR124,BGL_ABT_932_1AC_CASR903R963
BGL_2BC_901_1AG_A_CASR9KTR124,BGL_2BC_901_1AC_T_CASR920R948
BGL_2BC_901_1AG_A_CASR9KTR124,BGL_2BC_901_1AC_T_CASR920R948
BGL_2BC_901_1AG_A_CASR9KTR124,BGL_ABT_918_1AC_CASR903R963

if value(BGL_KMR_919_1AC_B_CASR920R899) is found in column 1 then pick it's corresponding row value from column 2 and if not found output the result in new file like below

output file :
Code:
BGL_0BC_901_1AG_A_CASR9KTR176,BGL_KMR_919_1AC_B_CASR920R899

Continuing, Now again for(BGL_0BC_901_1AG_A_CASR9KTR176) start and take the second row value(BGL_CHT_903_1AC_B_CASR920R879) from column 2 and look for this in column 1 and if found pick for its corresponding row value(BGL_BAM_910_1AC_B_CASR920R879) from column 2

Code:
column1                            column 2
BGL_CHT_903_1AC_B_CASR920R879,BGL_BAM_910_1AC_B_CASR920R879

Now again look for value(BGL_BAM_910_1AC_B_CASR920R879) in column 1, if found pick for its corresponding row value(BGL_BAM_912_2AC_B_CASR920R879) from column 2 In this case we don't have to consider(BGL_CHT_903_1AC_B_CASR920R879) as we have already looked for it in above part

Code:
column 1                          column 2
BGL_BAM_910_1AC_B_CASR920R879 ,BGL_CHT_903_1AC_B_CASR920R879
BGL_BAM_910_1AC_B_CASR920R879,BGL_BAM_912_2AC_B_CASR920R879

Now again look for row value picked from column 2(BGL_BAM_912_2AC_B_CASR920R879) in column 1 and if found pick for its corresponding column 2 value(BGL_BAM_912_1AC_B_CASR920R879) In this case also we don't have to consider(BGL_BAM_910_1AC_B_CASR920R879) as we have already looked for it in above part

Code:
BGL_BAM_912_2AC_B_CASR920R879,BGL_BAM_910_1AC_B_CASR920R879
BGL_BAM_912_2AC_B_CASR920R879,BGL_BAM_912_1AC_B_CASR920R879

Now again look for row value picked from column 2(BGL_BAM_912_1AC_B_CASR920R879) in column 1 and if found pick for its corresponding column 2 value(BGL_BAM_907_1AC_B_CASR920R879) Tn this case also we don't have to consider(BGL_BAM_912_2AC_B_CASR920R879 as we have already looked for it in above part

Code:
BGL_BAM_912_1AC_B_CASR920R879,BGL_BAM_907_1AC_B_CASR920R879
BGL_BAM_912_1AC_B_CASR920R879,BGL_BAM_912_2AC_B_CASR920R879

Now again look for row value picked from column 2(BGL_BAM_907_1AC_B_CASR920R879) in column 1 and if don't find any corresponding row value in column 2. Stop the search and append the result to above sampple output file in below format

Code:
GL_0BC_901_1AG_A_CASR9KTR176,BGL_KMR_919_1AC_B_CASR920R899
BGL_0BC_901_1AG_A_CASR9KTR176,BGL_CHT_903_1AC_B_CASR920R879,BGL_BAM_912_2AC_B_CASR920R879,BGL_BAM_912_1AC_B_CASR920R879,BGL_BAM_907_1AC_B_CASR920R879

Similary grep done for(BGL_0BC_901_1AG_A_CASR9KTR176), have to repeat and keep looking for the values in other rows values in column 2 in complete csv file and have to perform operation like above.

Similary have to perform operation like above for all others strings like BGL_2BC_901_1AG_A_CASR9KTR124 that has 9K in column 1 in this csv file.

Code i am trying but dnt know how to complete the code for above required

Code:
awk '
NR==FNR{
    assoc[$1]=$2
    next
}
FNR!=1{         
    printf "%s,%s", $1,$2
    seen[$1]; seen[$2]
    search=$2 
    while((search in assoc) && !(assoc[search] in seen)){
        search=assoc[search]
        printf ",%s", search
        seen[search]
    }
    print ""
    for(var in seen){ 
         delete seen[var]
    }
}' inputfile.csv inputfile.csv > output.csv

# 2  
Hi, the FS value does not seem to be set to ,
Code:
awk -F, '

Then I get this output with your script:

Code:
BGL_0BC_901_1AG_A_CASR9KTR176,BGL_KMR_919_1AC_B_CASR920R899
BGL_0BC_901_1AG_A_CASR9KTR176,BGL_CHT_903_1AC_B_CASR920R879,BGL_BAM_910_1AC_B_CASR920R879,BGL_BAM_912_2AC_B_CASR920R879,BGL_BAM_912_1AC_B_CASR920R879
BGL_0BC_901_1AG_A_CASR9KTR176,BGL_0UT_901_1AC_CASR903R551
BGL_0BC_901_1AG_A_CASR9KTR176,BGL_YOT_919_1AC_CASR903R458
BGL_0BC_901_1AG_A_CASR9KTR176,BGL_0BC_901_1AC_T_CASR920R504
BGL_CHT_903_1AC_B_CASR920R879,BGL_BAM_910_1AC_B_CASR920R879,BGL_BAM_912_2AC_B_CASR920R879,BGL_BAM_912_1AC_B_CASR920R879
BGL_BAM_910_1AC_B_CASR920R879,BGL_CHT_903_1AC_B_CASR920R879
BGL_BAM_910_1AC_B_CASR920R879,BGL_BAM_912_2AC_B_CASR920R879,BGL_BAM_912_1AC_B_CASR920R879
BGL_BAM_912_2AC_B_CASR920R879,BGL_BAM_910_1AC_B_CASR920R879
BGL_BAM_912_2AC_B_CASR920R879,BGL_BAM_912_1AC_B_CASR920R879
BGL_BAM_912_1AC_B_CASR920R879,BGL_BAM_907_1AC_B_CASR920R879
BGL_BAM_912_1AC_B_CASR920R879,BGL_BAM_912_2AC_B_CASR920R879
BGL_2BC_901_1AG_A_CASR9KTR124,BGL_BGM_908_1AC_CASR903R173
BGL_2BC_901_1AG_A_CASR9KTR124,BGL_ABT_932_1AC_CASR903R963
BGL_2BC_901_1AG_A_CASR9KTR124,BGL_2BC_901_1AC_T_CASR920R948
BGL_2BC_901_1AG_A_CASR9KTR124,BGL_2BC_901_1AC_T_CASR920R948
BGL_2BC_901_1AG_A_CASR9KTR124,BGL_ABT_918_1AC_CASR903R963

# 3  
Hi @Scrutinizer.. issue is, in output file each line has to start with string containing 9K like below .

Code:
GL_0BC_901_1AG_A_CASR9KTR176,BGL_KMR_919_1AC_B_CASR920R899
BGL_0BC_901_1AG_A_CASR9KTR176,BGL_CHT_903_1AC_B_CASR920R879,BGL_BAM_912_2AC_B_CASR920R879,BGL_BAM_912_1AC_B_CASR920R879,BGL_BAM_907_1AC_B_CASR920R879

For example, consider below sample input file :

Need to grep row whose string contains 9K in column 1 and then grep its corresponding row in column 2. Suppose check for 9K1 then grep A1, check for A1 in column 1 if not, output result as shown in below expected output file
Then again check for 9K1, grep A2, look for A2 in column1, if found grep its corresponding row value B2.
Now check for B2 value in column 1, then grep for C2 instead of A2 as we have already considered previously in lookup. Now look for C2 in column and grep D2 instead of B2.
This needs to be checked for all rows in column 1 that contains 9K, as there can be rows with value 9K2, 9K3, 9K4 in column 1 with corresponding data in column2.

Sample Input File :
Code:
9K1,A1
9K1,A2
9K1,A3
9K1,A4 
9K1,A5 
A2,B2
B2,A2
B2,C2
C2,B2
C2,D2
A5,B5
B5,C5
B5,A5
9K1,A6
A6,B6
B6,A6
B6,C6

Output required :

Code:
9K1,A1
9K1,A2,B2,C2,D2
9K1,A3
9K1,A4
9K1,A5,B5,C5
9K1,A6,B6,C6

But below is output returned by code.(which is not as expected output required)

Code:
9K1,A3
9K1,A4
9K1,A5
9K1,A6,B6,C6


Last edited by as7951; 07-25-2019 at 11:18 PM..
# 4  
You should store all the $2 values that match as a CSV in assoc. Then use a recursive function to walk thru assoc for each match:

Code:
awk -F, '
function prn_assoc(val,cnt,newvals)
{
   if(val in seen) return
   seen[val]
   printf ",%s",val
   split(assoc[val], newvals)
   for(cnt in newvals)
       prn_assoc(newvals[cnt])
}
NR==FNR{
    if($1 in assoc)
       assoc[$1]=assoc[$1] FS $2
    else assoc[$1]=$2
    next
}
$1~"9K" {
   printf "%s", $1
   split("", seen)
   seen[$1]
   prn_assoc($2)
   printf "\n"
}' inputfile.csv inputfile.csv > output.csv


Last edited by Chubler_XL; 07-25-2019 at 05:34 PM.. Reason: simplify prn_assoc() and ensure $1 populated in seen[]
This User Gave Thanks to Chubler_XL For This Post:
# 5  
@chubler_XL....with you code i am getting below 6 row data from which mentioned below one row data(9K1,A5,B5,C5)is missing instead of this in output i am getting row with value 9K1,A5
Could you please help to fix this.

Missing data :
Code:
9K1,A5,B5,C5

Output :
Code:
9K1,A1
9K1,A2,B2,C2,D2
9K1,A3
9K1,A4
9K1,A5
9K1,A6,B6,C6


Below is Required output :

Code:
9K1,A1
9K1,A2,B2,C2,D2
9K1,A3
9K1,A4
9K1,A5,B5,C5
9K1,A6,B6,C6


Last edited by as7951; 07-25-2019 at 11:26 PM..
# 6  
Rows 4 and 5 of your input file have a space following the 2nd field:

Code:
9K1,A1
9K1,A2
9K1,A3
9K1,A4<SPACE>
9K1,A5<SPACE>

This is why they don't match the keys later on and is causing your missing data.
# 7  
Hi Chubler_XL,

Thank you very much..
Really appreciate your help.
Got the expected output.
Login or Register for Dates, Times and to Reply

Previous Thread | Next Thread
Thread Tools Search this Thread
Search this Thread:
Advanced Search

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Merging rows based on same ID in First column.

Hellow, I have a tab-delimited file with 3 columns : BINPACKER.13259.1.p2 SSF48239 BINPACKER.13259.1.p2 PF13243 BINPACKER.13259.1.p2 G3DSA:1.50.10.20 BINPACKER.13259.2.p2 SSF48239 BINPACKER.13259.2.p2 PF13243 BINPACKER.13259.2.p2 G3DSA:1.50.10.20... (7 Replies)
Discussion started by: anjaliANJALI
7 Replies

2. Shell Programming and Scripting

How can I use find command to search string/pattern in a file recursively?

Hi, How can I use find command to search string/pattern in a file recursively? What I tried: find . -type f -exec cat {} | grep "make" \; Output: grep: find: ;: No such file or directory missing argument to `-exec' And this: find . -type f -exec cat {} \; -exec grep "make" {} \;... (12 Replies)
Discussion started by: cola
12 Replies

3. UNIX for Dummies Questions & Answers

merging rows into new file based on rows and first column

I have 2 files, file01= 7 columns, row unknown (but few) file02= 7 columns, row unknown (but many) now I want to create an output with the first field that is shared in both of them and then subtract the results from the rest of the fields and print there e.g. file 01 James|0|50|25|10|50|30... (1 Reply)
Discussion started by: A-V
1 Replies

4. Shell Programming and Scripting

Find and copy files based on todays date and search for a particular string

Hi All, I am new to shell srcipting. Problem : I need to write a script which copy the log files from /prod/logs directory based on todays date like (Jul 17) and place it to /home/hzjnr0 directory and then search the copied logfiles for the string "@ending successfully on Thu Jul 17". If... (2 Replies)
Discussion started by: mail.chiranjit
2 Replies

5. Shell Programming and Scripting

sed or awk command to replace a string pattern with another string based on position of this string

here is what i want to achieve... consider a file contains below contents. the file size is large about 60mb cat dump.sql INSERT INTO `table1` (`id`, `action`, `date`, `descrip`, `lastModified`) VALUES (1,'Change','2011-05-05 00:00:00','Account Updated','2012-02-10... (10 Replies)
Discussion started by: vivek d r
10 Replies

6. Shell Programming and Scripting

bash script to find date based on search string for continuesly updating file

Hi All, I am very new to UNIX and I have tried this for a longtime now and unable to crack it.... There is a file that is continuously updating. I need to search for the string and find the date @ which it updated every day..... eg: String is "work started" The log entry is as below: ... (1 Reply)
Discussion started by: Nithz
1 Replies

7. Shell Programming and Scripting

need to remove duplicates based on key in first column and pattern in last column

Given a file such as this I need to remove the duplicates. 00060011 PAUL BOWSTEIN ad_waq3_921_20100826_010517.txt 00060011 PAUL BOWSTEIN ad_waq3_921_20100827_010528.txt 0624-01 RUT CORPORATION ad_sade3_10_20100827_010528.txt 0624-01 RUT CORPORATION ... (13 Replies)
Discussion started by: script_op2a
13 Replies

8. Shell Programming and Scripting

Print a pattern between the xml tags based on a search pattern

Hi all, I am trying to extract the values ( text between the xml tags) based on the Order Number. here is the sample input <?xml version="1.0" encoding="UTF-8"?> <NJCustomer> <Header> <MessageIdentifier>Y504173382</MessageIdentifier> ... (13 Replies)
Discussion started by: oky
13 Replies

9. Shell Programming and Scripting

Script to find the average of a given column and also for specified number of rows?

Hi Friends, In continuation to my earlier post https://www.unix.com/shell-programming-scripting/99166-script-find-average-given-column-also-specified-number-rows.html I am extending my problem as follows. Input: Column1 Column2 MAS 1 MAS 4 ... (2 Replies)
Discussion started by: ks_reddy
2 Replies

10. Shell Programming and Scripting

Script to find the average of a given column and also for specified number of rows??

Hi friends I have 100 files in my directory. Each file look like this.. Temp1 Temp2 Temp3 MAS 1 2 3 MAS 4 5 6 MAS 7 8 9 Delhi 10 11 12 Delhi 13 14 15 Delhi 16 17 ... (4 Replies)
Discussion started by: ks_reddy
4 Replies

Featured Tech Videos