Request to check:remove duplicates only in first column


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Request to check:remove duplicates only in first column
# 15  
Old 07-26-2012
Yep, working on it now. I think i've found the problem, it seems you have some duplicate lines Smilie will post up some code soon.

---------- Post updated at 10:26 PM ---------- Previous update was at 10:09 PM ----------

The data is already in order so all you need to run the following code on the data set and my code should work Smilie give it a try Smilie

Code:
cat <data_file.txt> | uniq > <output_file>.txt

# 16  
Old 07-28-2012
Hi, did this work for you?
# 17  
Old 08-01-2012
Hi

Thanks for reply.


It s still shwoing some error. But thanks for ur patience

Code:
bash-3.2$ uniq -c sarattdnewdruggene.txt >sarattdnewdruggene3.txt
bash-3.2$ cat <sarattdnewdruggene.txt> | uniq > <sarattdnewdruggene4>.txt
bash: syntax error near unexpected token `|'
bash-3.2$ cat <sarattdnewdruggene.txt> | uniq > <sarattdnewdruggene4.txt>
bash: syntax error near unexpected token `|'
bash-3.2$ cat <sarattdnewdruggene.txt> | uniq | <sarattdnewdruggene4.txt>
bash: syntax error near unexpected token `|'
bash-3.2$ cat <sarattdnewdruggene.txt> | uniq  <sarattdnewdruggene4.txt>
bash: syntax error near unexpected token `|'
bash-3.2$ cat <sarattdnewdruggene.txt>  uniq  <sarattdnewdruggene4.txt>
bash: syntax error near unexpected token `newline'
bash-3.2$

# 18  
Old 08-01-2012
the code below should work Smilie

Code:
cat sarattdnewdruggene.txt | uniq > sarattdnewdruggene4.txt

# 19  
Old 08-01-2012
Hi

Using this, the out put is exactly same as input no change

igot one mroe file like that

if I have input like this

Quote:
CHRM1 P11229 Pirenzepine DAP000492 Peptic ulcer disease Approved T2D
CHRM1 P11229 Glycopyrrolate DAP001116 Anesthetic Approved T2D
CHRM1 P11229 Clidinium DAP001117 Abdominal/stomach pain Approved T2D
CHRM1 P11229 Dicyclomine DAP001118 Irritable bowel syndrome Approved T2D
CHRM1 P11229 Ethopropazine DAP001119 Parkinson's disease Approved T2D
CHRM1 P11229 Cycrimine DAP001120 Parkinson's disease Approved T2D
CHRM1 P11229 Benztropine DAP001121 Parkinson's disease Approved T2D
CHRM1 P11229 Trihexyphenidyl DAP001122 Parkinson's disease Approved T2D
CHRM1 P11229 Propantheline DAP001123 Excessive sweating (hyperhidrosis) Approved T2D
CHRM1 P11229 Oxyphenonium DAP001124 Spasm Approved T2D
CHRM1 P11229 Biperiden DAP001125 Parkinson's disease Approved T2D
CHRM1 P11229 Talsaclidine isomer DCL000268 Alzheimer's disease Discontinued T2D
CHRM1 P11229 Sabcomeline hydrochloride DCL000279 Cardiovascular diseases Phase IIa T2D
CHRM1 P11229 Talsaclidine fumarate DCL000303 Alzheimer's disease Discontinued T2D
CHRM1 P11229 Xanomeline tartrate DCL000328 Alzheimer's disease Phase II T2D
CHRM1 P11229 GSK573719 DCL000381 Chronic Obstructive Pulmonary Disease (COPD) Phase II T2D
CHRM1 P11229 GSK961081 DCL000397 Chronic Obstructive Pulmonary Disease (COPD) Phase II completed T2D
CHRM1 P11229 GSK1034702 DCL000402 Schizophrenia, Dementia Phase I completed T2D
CHRM1 P11229 Darotropium DCL000514 COPD Suspended in Phase II in GSK 2009 Report T2D
CHRM1 P11229 Darotropium + 642444 DCL000515 COPD Phase III T2D
CHRM1 P11229 Revatropate DCL000957 Chronic obstructive pulmonary disease Discontinued in Phase I T2D
FLT1 P17948 Sorafenib DAP000006 Advanced renal cell carcinoma Launched CAD
FLT1 P17948 Sorafenib DAP000006 Hepatocellular carcinoma, NSCLC, melanoma Phase III CAD
FLT1 P17948 Sorafenib DAP000006 Myelodyspalstic syndrome, AML, head & neck cancer, breast, colon, ovarian, pancreatic cancer Phase II CAD
FLT1 P17948 Ranibizumab DAP001260 Age-related macular degeneration Approved CAD
FLT1 P17948 Ranibizumab DAP001260 Diabetic macular edema and retinal vein occlusion Phase III CAD
FLT1 P17
And, I want output in which only repeatition in first cloumn has to be removed.here second columnmoves towards left after quoting but I dont want, I want second column shuld remaina s it is in second column

Quote:
CHRM1 P11229 Pirenzepine DAP000492 Peptic ulcer disease Approved T2D
P11229 Glycopyrrolate DAP001116 Anesthetic Approved T2D
P11229 Clidinium DAP001117 Abdominal/stomach pain Approved T2D
P11229 Dicyclomine DAP001118 Irritable bowel syndrome Approved T2D
P11229 Ethopropazine DAP001119 Parkinson's disease Approved T2D
P11229 Cycrimine DAP001120 Parkinson's disease Approved T2D
1 P11229 Benztropine DAP001121 Parkinson's disease Approved T2D
P11229 Trihexyphenidyl DAP001122 Parkinson's disease Approved T2D
P11229 Propantheline DAP001123 Excessive sweating (hyperhidrosis) Approved T2D
P11229 Oxyphenonium DAP001124 Spasm Approved T2D
P11229 Biperiden DAP001125 Parkinson's disease Approved T2D
P11229 Talsaclidine isomer DCL000268 Alzheimer's disease Discontinued T2D
P11229 Sabcomeline hydrochloride DCL000279 Cardiovascular diseases Phase IIa T2D
P11229 Talsaclidine fumarate DCL000303 Alzheimer's disease Discontinued T2D
P11229 Xanomeline tartrate DCL000328 Alzheimer's disease Phase II T2D
P11229 GSK573719 DCL000381 Chronic Obstructive Pulmonary Disease (COPD) Phase II T2D
P11229 GSK961081 DCL000397 Chronic Obstructive Pulmonary Disease (COPD) Phase II completed T2D
P11229 GSK1034702 DCL000402 Schizophrenia, Dementia Phase I completed T2D
P11229 Darotropium DCL000514 COPD Suspended in Phase II in GSK 2009 Report T2D
P11229 Darotropium + 642444 DCL000515 COPD Phase III T2D
P11229 Revatropate DCL000957 Chronic obstructive pulmonary disease Discontinued in Phase I T2D
FLT1 P17948 Sorafenib DAP000006 Advanced renal cell carcinoma Launched CAD
P17948 Sorafenib DAP000006 Hepatocellular carcinoma, NSCLC, melanoma Phase III CAD
P17948 Sorafenib DAP000006 Myelodyspalstic syndrome, AML, head & neck cancer, breast, colon, ovarian, pancreatic cancer Phase II CAD
P17948 Ranibizumab DAP001260 Age-related macular degeneration Approved CAD
P17948 Ranibizumab DAP001260 Diabetic macular edema and retinal vein occlusion Phase III CAD
P17
# 20  
Old 08-01-2012
can you please post up your data source?

with your test data

Code:
CHRM1 P11229 Pirenzepine DAP000492 Peptic ulcer disease Approved T2D
CHRM1 P11229 Glycopyrrolate DAP001116 Anesthetic Approved T2D
CHRM1 P11229 Clidinium DAP001117 Abdominal/stomach pain Approved T2D
CHRM1 P11229 Dicyclomine DAP001118 Irritable bowel syndrome Approved T2D
CHRM1 P11229 Ethopropazine DAP001119 Parkinson's disease Approved T2D
CHRM1 P11229 Cycrimine DAP001120 Parkinson's disease Approved T2D
CHRM1 P11229 Benztropine DAP001121 Parkinson's disease Approved T2D
CHRM1 P11229 Trihexyphenidyl DAP001122 Parkinson's disease Approved T2D
CHRM1 P11229 Propantheline DAP001123 Excessive sweating (hyperhidrosis) Approved T2D
CHRM1 P11229 Oxyphenonium DAP001124 Spasm Approved T2D
CHRM1 P11229 Biperiden DAP001125 Parkinson's disease Approved T2D
CHRM1 P11229 Talsaclidine isomer DCL000268 Alzheimer's disease Discontinued T2D
CHRM1 P11229 Sabcomeline hydrochloride DCL000279 Cardiovascular diseases Phase IIa T2D
CHRM1 P11229 Talsaclidine fumarate DCL000303 Alzheimer's disease Discontinued T2D
CHRM1 P11229 Xanomeline tartrate DCL000328 Alzheimer's disease Phase II T2D
CHRM1 P11229 GSK573719 DCL000381 Chronic Obstructive Pulmonary Disease (COPD) Phase II T2D
CHRM1 P11229 GSK961081 DCL000397 Chronic Obstructive Pulmonary Disease (COPD) Phase II completed T2D
CHRM1 P11229 GSK1034702 DCL000402 Schizophrenia, Dementia Phase I completed T2D
CHRM1 P11229 Darotropium DCL000514 COPD Suspended in Phase II in GSK 2009 Report T2D
CHRM1 P11229 Darotropium + 642444 DCL000515 COPD Phase III T2D
CHRM1 P11229 Revatropate DCL000957 Chronic obstructive pulmonary disease Discontinued in Phase I T2D
FLT1 P17948 Sorafenib DAP000006 Advanced renal cell carcinoma Launched CAD
FLT1 P17948 Sorafenib DAP000006 Hepatocellular carcinoma, NSCLC, melanoma Phase III CAD
FLT1 P17948 Sorafenib DAP000006 Myelodyspalstic syndrome, AML, head & neck cancer, breast, colon, ovarian, pancreatic cancer Phase II CAD
FLT1 P17948 Ranibizumab DAP001260 Age-related macular degeneration Approved CAD
FLT1 P17948 Ranibizumab DAP001260 Diabetic macular edema and retinal vein occlusion Phase III CAD

my code resulted in this

Code:
CHRM1 P11229 Pirenzepine DAP000492 Peptic ulcer disease Approved T2D
Glycopyrrolate DAP001116 Anesthetic Approved T2D
Clidinium DAP001117 Abdominal/stomach pain Approved T2D
Dicyclomine DAP001118 Irritable bowel syndrome Approved T2D
Ethopropazine DAP001119 Parkinson's disease Approved T2D
Cycrimine DAP001120 Parkinson's disease Approved T2D
Benztropine DAP001121 Parkinson's disease Approved T2D
Trihexyphenidyl DAP001122 Parkinson's disease Approved T2D
Propantheline DAP001123 Excessive sweating (hyperhidrosis) Approved T2D
Oxyphenonium DAP001124 Spasm Approved T2D
Biperiden DAP001125 Parkinson's disease Approved T2D
Talsaclidine isomer DCL000268 Alzheimer's disease Discontinued T2D
Sabcomeline hydrochloride DCL000279 Cardiovascular diseases Phase IIa T2D
Talsaclidine fumarate DCL000303 Alzheimer's disease Discontinued T2D
Xanomeline tartrate DCL000328 Alzheimer's disease Phase II T2D
GSK573719 DCL000381 Chronic Obstructive Pulmonary Disease (COPD) Phase II T2D
GSK961081 DCL000397 Chronic Obstructive Pulmonary Disease (COPD) Phase II completed T2D
GSK1034702 DCL000402 Schizophrenia, Dementia Phase I completed T2D
Darotropium DCL000514 COPD Suspended in Phase II in GSK 2009 Report T2D
+ 642444 DCL000515 COPD Phase III T2D
Revatropate DCL000957 Chronic obstructive pulmonary disease Discontinued in Phase I T2D
FLT1 P17948 Sorafenib DAP000006 Advanced renal cell carcinoma Launched CAD
Hepatocellular carcinoma, NSCLC, melanoma Phase III CAD
Myelodyspalstic syndrome, AML, head & neck cancer, breast, colon, ovarian, pancreatic cancer Phase II CAD
Ranibizumab DAP001260 Age-related macular degeneration Approved CAD
Diabetic macular edema and retinal vein occlusion Phase III CAD

using this code

Code:
#!/usr/bin/python

import sys

if len(sys.argv) < 2:
        print "usage:",sys.argv[0],"<file_path>"
        sys.exit(69)

f = open(sys.argv[1], 'r')
lines = f.readlines()

count = 0
index = 0

for item in lines:
        if count != 0:
                left  = lines[count].split()    
                right = lines[count-1].split()

                while left[index] == right[index]:
                        index += 1

                print ' '.join(left[index:])
                index = 0
        else:
                print lines[count].rstrip() 

        count += 1


Last edited by shitson; 08-01-2012 at 06:47 AM..
# 21  
Old 08-01-2012
Thanks for ur reply.

Actually in ur output everything is fine.

just I want this shuld not move to left hand side and shuld remain at its place as it was in input for CHRM1 same for FLT1. tht's it. Kindly inform me whether I hae to attache data sample list?

Quote:
Glycopyrrolate DAP001116 Anesthetic Approved T2D
Clidinium DAP001117 Abdominal/stomach pain Approved T2D
Dicyclomine DAP001118 Irritable bowel syndrome Approved T2D
Ethopropazine DAP001119 Parkinson's disease Approved T2D
Cycrimine DAP001120 Parkinson's disease Approved T2D
Benztropine DAP001121 Parkinson's disease Approved T2D
Trihexyphenidyl DAP001122 Parkinson's disease Approved T2D
Propantheline DAP001123 Excessive sweating (hyperhidrosis) Approved T2D
Oxyphenonium DAP001124 Spasm Approved T2D
Biperiden DAP001125 Parkinson's disease Approved T2D
Talsaclidine isomer DCL000268 Alzheimer's disease Discontinued T2D
Sabcomeline hydrochloride DCL000279 Cardiovascular diseases Phase IIa T2D
Talsaclidine fumarate DCL000303 Alzheimer's disease Discontinued T2D
Xanomeline tartrate DCL000328 Alzheimer's disease Phase II T2D
GSK573719 DCL000381 Chronic Obstructive Pulmonary Disease (COPD) Phase II T2D
GSK961081 DCL000397 Chronic Obstructive Pulmonary Disease (COPD) Phase II completed T2D
GSK1034702 DCL000402 Schizophrenia, Dementia Phase I completed T2D
Darotropium DCL000514 COPD Suspended in Phase II in GSK 2009 Report T2D
+ 642444 DCL000515 COPD Phase III T2D
Revatropate DCL000957 Chronic obstructive pulmonary disease Discontinued in Phase I T2D
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Remove duplicates according to their frequency in column

Hi all, I have huge a tab-delimited file with the following format and I want to remove the duplicates according to their frequency based on Column2 and Column3. Column1 Column2 Column3 Column4 Column5 Column6 Column7 1 user1 access1 word word 3 2 2 user2 access2 ... (10 Replies)
Discussion started by: corfuitl
10 Replies

2. Shell Programming and Scripting

Remove duplicates within row and separate column

Hi all I have following kind of input file ESR1 PA156 leflunomide PA450192 leflunomide CHST3 PA26503 docetaxel Pa4586; thalidomide Pa34958; decetaxel docetaxel docetaxel I want to remove duplicates and I want to separate anything before and after PAxxxx entry into columns or... (1 Reply)
Discussion started by: manigrover
1 Replies

3. Shell Programming and Scripting

Request to check:remove duplicates and write sytematically

Hi all I have a file with following input It contains 5 columns gene name drug drug ID disease approved Now the same gene is repeated many times with different data in column2,3 ,4,5 I want to arrange dat in such a way that there shuld be one entry in the column(no... (2 Replies)
Discussion started by: manigrover
2 Replies

4. Shell Programming and Scripting

Request to check remove duplicates but write before it

Hi alll I have a file with following kind input I want in output duplicates should not be there but there should be numbering mentioned before that like (4 Replies)
Discussion started by: manigrover
4 Replies

5. Shell Programming and Scripting

Request to check:Remove duplicates

Hi all I have a file with following kind of data I want to remove duplicates according to first column so that output contains Kindly let me scripting regading this. (4 Replies)
Discussion started by: manigrover
4 Replies

6. Shell Programming and Scripting

Request to check:remove entries more than once in different column

Hi I have a file 12m 345693460 12 1234 12 1234 34 345 34 345 And I want output fiel as 12m 345693460 12 1234 34 345 hw can it be done Thanks (1 Reply)
Discussion started by: manigrover
1 Replies

7. Shell Programming and Scripting

Request to check:remove entries with N/A mentioned

Hi I have a file with following entries 122 N/A 123 5654656 123423 43534543 4544 45435 435454 N/A i Have to remove entries with N/A so that only 123 5654656 123423 43534543 4544 45435 remain in output file can anybody guide for a code/unix/perl (2 Replies)
Discussion started by: manigrover
2 Replies

8. Shell Programming and Scripting

Request to check:remove entries more than once

Hi I have a file like this 1234 2345 567889 567889 2345 234899420 83743 2345 67890 67890 ................ so on I want to delete entries which are more than once like 2345, 567889 and 67890 so that these appear once (4 Replies)
Discussion started by: manigrover
4 Replies

9. Shell Programming and Scripting

remove duplicates based on single column

Hello, I am new to shell scripting. I have a huge file with multiple columns for example: I have 5 columns below. HWUSI-EAS000_29:1:105 + chr5 76654650 AATTGGAA HHHHG HWUSI-EAS000_29:1:106 + chr5 76654650 AATTGGAA B@HYL HWUSI-EAS000_29:1:108 + ... (4 Replies)
Discussion started by: Diya123
4 Replies

10. Shell Programming and Scripting

need to remove duplicates based on key in first column and pattern in last column

Given a file such as this I need to remove the duplicates. 00060011 PAUL BOWSTEIN ad_waq3_921_20100826_010517.txt 00060011 PAUL BOWSTEIN ad_waq3_921_20100827_010528.txt 0624-01 RUT CORPORATION ad_sade3_10_20100827_010528.txt 0624-01 RUT CORPORATION ... (13 Replies)
Discussion started by: script_op2a
13 Replies
Login or Register to Ask a Question