Request to check:remove duplicates only in first column


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Request to check:remove duplicates only in first column
# 8  
Old 07-25-2012
Request to check

Thanks for help!

But there are stilll some errros in the output


Data is mixed up between columns:there is no clear indiaction of separation even as it was previously.
I have to just remove duplicates in first column I dont have to change anything else even not a single spacing.

Quote:
Muscarinic acetylcholine receptor Bethanechol DAP000263 Urinary retention Approved
Trospium DAP000342 Spasm Approved
Oxyphencyclimine DAP000835 Gastrointestinal disorders Approved
Tridihexethyl DAP000836 Acquired nystagmus Approved
Anisotropine Methylbromide DAP000837 Peptic ulcer disease Approved
Hyoscyamine DAP001108 Gastrointestinal disorders Approved
Methantheline DAP001109 Irritable bowel syndrome Approved
Procyclidine DAP001110 Parkinson's disease Approved
Cyclopentolate DAP001111 Pediatric eye examinations Approved
Ipratropium DAP001112 Obstructive lung diseases Approved
Pilocarpine DAP001113 Glaucoma Approved
Flavoxate DAP001114 Muscle Relaxant Approved
Mepenzolate DAP001115 Peptic ulcer disease Approved
Ispaghula DAP001486 Irritable bowel syndrome Approved
Mebeverine DAP001494 Irritable bowel syndrome Approved
Trihexyphenidyl HCl DAP001532 Parkinson's Disease Approved
Aclidinium bromide DCL000677 Chronic obstructive pulmonary disease Phase III
CHF 5407 DCL000750 Chronic obstructive pulmonary disease Phase I
GSK233705 DCL000823 Chronic obstructive pulmonary disease Phase II completed
NVA237 DCL000901 Chronic obstructive pulmonary disease Phase III
Org-23366 DCL000911 Schizophrenia No development reported
OrM3 DCL000913 Chronic obstructive pulmonary disease Phase IIb
M1 Pirenzepine DAP000492 Peptic ulcer disease Approved
Glycopyrrolate DAP001116 Anesthetic Approved
Clidinium DAP001117 Abdominal/stomach pain Approved
Dicyclomine DAP001118 Irritable bowel syndrome Approved
Ethopropazine DAP001119 Parkinson's disease Approved
Cycrimine DAP001120 Parkinson's disease Approved
Benztropine DAP001121 Parkinson's disease Approved
Trihexyphenidyl DAP001122 Parkinson's disease Approved
Propantheline DAP001123 Excessive sweating (hyperhidrosis) Approved
Oxyphenonium DAP001124 Spasm Approved
Biperiden DAP001125 Parkinson's disease Approved
Talsaclidine isomer DCL000268 Alzheimer's disease Discontinued
Sabcomeline hydrochloride DCL000279 Cardiovascular diseases Phase IIa
fumarate DCL000303 Alzheimer's disease Discontinued
Xanomeline tartrate DCL000328 Alzheimer's disease Phase II
GSK573719 DCL000381 Chronic Obstructive Pulmonary Disease (COPD) Phase II
GSK961081 DCL000397 Chronic Obstructive Pulmonary Disease (COPD) Phase II completed
GSK1034702 DCL000402 Schizophrenia, Dementia Phase I completed
Darotropium DCL000514 COPD Suspended in Phase II in GSK 2009 Report
+ 642444 DCL000515 COPD Phase III
Revatropate DCL000957 Chronic obstructive pulmo
# 9  
Old 07-25-2012
Code:
#!/usr/bin/python

import sys

if len(sys.argv) < 2:
        print "usage:",sys.argv[0],"<file_path>"
        sys.exit(69)

f = open(sys.argv[1], 'r')
lines = f.readlines()

count = 0
index = 0

for item in lines:
        if count != 0:
                left  = lines[count].split()    
                right = lines[count-1].split()

                while left[index] == right[index]:
                        index += 1

                print ' '.join(left[index:])
                index = 0
        else:
                print lines[count].rstrip() 

        count += 1

What about this?

To use this code do the following:
  1. copy and paste this into a file, i would recommend calling it duplicate.py
  2. run the command
    Code:
    chmod +x duplicate.py

  3. then to run the script
    Code:
    ./duplicate.py path_to_file


Edit: very basic error checking, was a rush job not very good with python!

Last edited by shitson; 07-25-2012 at 09:36 AM..
# 10  
Old 07-26-2012
Request to check

Hi

Thanks for reply
But there are still errors, data is mixed up and not in proper columns

if the input is
Code:
Serine/threonine protein kinase 12    AZD1152    DCL000452    Myeloid Leukemia    Phase I/II
Serine/threonine protein kinase 12    AZD1152    DCL000452    Acute Myeloid Leukemia, Haematological malignancies    Phase II
Serine/threonine protein kinase 12    MK-5108    DCL000572    Cancer;  Neoplasms;  Tumors    Phase I
Serine/threonine protein kinase 12    TAK-901    DCL000657    Advanced malignancies    Phase I
Serine/threonine protein kinase 12    AT-9283    DCL001068    Adult solid tumours, NHL, AML, ALL, CML, MDS and myelofibrosis    Phase I/II
Serine/threonine protein kinase 12    CYC-116    DCL001070    Advanced solid tumours    Terminated in Phase I
Serine/threonine protein kinase 12    GSK1070916    DCL001072    Advanced solid tumours    Phase I
Serine/threonine protein kinase 12    PF-03814735    DCL001076    Advanced solid tumours    Phase I
Serine/threonine protein kinase 12    PHA-739358    DCL001078    CML that relapsed after imatinib or BCR¨CABL-targeted therapy;   Metastatic Hormone Refractory Prostate Cancer (MHRPC)    Phase II
Serine/threonine protein kinase 12    VX-689    DCL001083    Cancer;  Neoplasms;  Tumors    Phase I
Phosphatidylinositol-4,5-bisphosphate 3-kinase catalytic subunit, gamma isoform    SF1126    DCL000228    Solid Tumors    Phase I
Phosphatidylinositol-4,5-bisphosphate 3-kinase catalytic subunit, gamma isoform    TG100-115    DCL000246    Angioedema, Myocardial infarction    Phase I/II
Phosphatidylinositol-4,5-bisphosphate 3-kinase catalytic subunit, gamma isoform    XL147    DCL000262    Endometrial Cancer    Phase II
Phosphatidylinositol-4,5-bisphosphate 3-kinase catalytic subunit, gamma isoform    XL765    DCL000264    Solid tumours;   non-small-cell lung cancer;   malignant gliomas    Phase I/II
Phosphatidylinositol-4,5-bisphosphate 3-kinase catalytic subunit, gamma isoform    AZD6482    DCL000476    Thrombosis    Phase II
Phosphatidylinositol-4,5-bisphosphate 3-kinase catalytic subunit, gamma isoform    LY294002    DCL000600    Cancer    Discontinued in Phase I
Phosphatidylinositol-4,5-bisphosphate 3-kinase catalytic subunit, gamma isoform    PI3K alpha    DCL000601    Cancer    Discontinued in Phase I
Phosphatidylinositol-4,5-bisphosphate 3-kinase catalytic subunit, gamma isoform    BEZ235    DCL001085    Advanced solid tumours;  Advanced breast cancer    Phase I/II
Phosphatidylinositol-4,5-bisphosphate 3-kinase catalytic subunit, gamma isoform    BGT226    DCL001086    Solid tumours;  Advanced breast cancer;  Cowden¡¯s syndrome    Phase I/II
Phosphatidylinositol-4,5-bisphosphate 3-kinase catalytic subunit, gamma isoform    BKM120    DCL001087    Metastatic Breast Cancer    Phase I/II
Phosphatidylinositol-4,5-bisphosphate 3-kinase catalytic subunit, gamma isoform    GDC0941    DCL001088    Advanced solid tumours;   non-Hodgkin¡¯s lymphoma    Phase I
Phosphatidylinositol-4,5-bisphosphate 3-kinase catalytic subunit, gamma isoform    GSK1059615    DCL001089    Advanced solid tumours;   metastatic breast cancer;   endometrial cancer;   lymphoma    Terminated in Phase I
Phosphatidylinositol-4,5-bisphosphate 3-kinase catalytic subunit, gamma isoform    PX-866    DCL001090    Advanced solid tumours    Phase I
Phosphatidylinositol-4,5-bisphosphate 3-kinase catalytic subunit, gamma isoform    CAL-101    DCL001091    Chronic lymphocytic leukaemia;   acute myeloid leukaemia;   non-Hodgkin¡¯s lymphoma    Phase I
Phosphatidylinositol-4,5-bisphosphate 3-kinase catalytic subunit, gamma isoform    GDC-0980    DCL001189    Advanced solid tumours, non-Hodgkin's lymphoma    Phase I
Hexokinase D    Lonidamine    DCL000153    Benign Prostatic Hyperplasia, Prostate Disorders    Terminated in Phase III
Hexokinase D    PSN-101    DCL000201    Diabetes Mellitus Type 1 and 2    Phase I
Hexokinase D    AZD1656    DCL000457    Type 2 Diabetes Mellitus    Phase II
Hexokinase D    AZD6370    DCL000475    Type 2 Diabetes    Phase I completed
Hexokinase D    R7201    DCL000614    Type 2 diabetes    Phase II
Hexokinase D    AZD5658    DCL001154    Obesity, Diabetes    Phase I
mRNA of Clusterin    OGX-011    DCL000186    Prostate Cancer, Breast Cancer, Lung Cancer    Phase III
Kinesin-like protein KIF11    ARRY-520    DCL000053    Cancer/Tumors    Phase I
Kinesin-like protein KIF11    Ispinesib    DCL000139    Pediatric    Phase I
Kinesin-like protein KIF11    Ispinesib    DCL000139    Head and Neck Cancer, Renal Cell Carcinoma, Ovarian Cancer, Solid Tumors    Phase II
Kinesin-like protein KIF11    Ispinesib    DCL000139    Lung Cancer    Phase II completed
Kinesin-like protein KIF11    SB-743921    DCL000224    Non-Hodgkin's Lymphoma, Cancer/Tumors    Phase I/II
Kinesin-like protein KIF11    4SC-205    DCL001129    Solid tumour and malignant lymphoma    Phase I
Neurotensin receptor type 1    CGX-1160    DCL000084    Acute or Chronic Pain    Phase I completed
Neurotensin receptor type 1    Meclinertant    DCL000163    Colorectal Cancer, Prostate Cancer, Schizophrenia, Schizoaffective Disorders, Psychosis, Depression, Lung Cancer    Discontinued in Phase III
Ribosomal protein S6 kinase    XL418    DCL000009    Solid Tumors    Suspended in Phase I
Interstitial collagenase    BMS 275291    DCL000003    Non-small Cell Lung Cancer, Hormone-refractory Prostate Cancer, Kaposi's Sarcoma    Discontinued in Phase III
Interstitial collagenase    Prinomastat    DCL000004    Brain Cancer    Discontinued in Phase III
Interstitial collagenase    Prinomastat    DCL000004    Lung Cancer, Prostate Cancer    Trial halted
Interstitial collagenase    Marimastat    DCL000005    Pancreatic Cancer, Lung Cancer    Discontinued in Phase III
Interstitial collagenase    BB-3644    DCL000014    Cancer/Tumors    Discontinued in Phase I
Interstitial collagenase    XL784    DCL001039    Diabetic nephropathy    Discontinued in Phase II
Interstitial collagenase    Batimastat    DPR000163    Cancers    Discontinued in Phase I
Integrin beta

the output which I will get is

Code:
AZD1152 DCL000452 Myeloid Leukemia Phase I/II
Acute Myeloid Leukemia, Haematological malignancies Phase II
MK-5108 DCL000572 Cancer; Neoplasms; Tumors Phase I
AT-9283 DCL001068 Adult solid tumours, NHL, AML, ALL, CML, MDS and myelofibrosis Phase I/II
CYC-116 DCL001070 Advanced solid tumours Terminated in Phase I
ENMD-2076 DCL001071 Ovarian Cancer, Fallopian Cancer, Peritoneal Cancer Phase II
PF-03814735 DCL001076 Advanced solid tumours Phase I
PHA-739358 DCL001078 CML that relapsed after imatinib or BCR¨CABL-targeted therapy; Metastatic Hormone Refractory Prostate Cancer (MHRPC) Phase II
VX-689 DCL001083 Cancer; Neoplasms; Tumors Phase I
Toll-like receptor 3 HspE7 (TLR3 agonist adjuvant) DCL000129 Anal intraepithelial neoplasia Discontinued in Phase I/II
Human Papillomavirus (HPV) Infections Discontinued in Phase I/II
5-hydroxy-tryptamine 3B receptor Cilansetron DCL000087 Irritable Bowel Syndrome (IBS), Diarrhea Phase III, Positive phase III results
Serine/threonine-protein kinase Chk2 XL844 DCL000017 Advanced solid tumours or lymphoma Suspended in Phase I
Fibronectin AS1409 DCL000055 Kidney Cancer, Melanoma Phase I
Plasma kallikrein Ecallantide DCL000108 Hereditary angioedema Approved
Glucosylceramidase Isofagomine tartrate DCL000138 Metabolic Disease Phase II
Protein kinase C gamma type Midostaurin DCL000165 Breast & colorectal cancer Phase I
Colon, breast, CLL, AML, GIST, solid tumours & non-Hodgkin's lymphoma Phase II
Alpha-galactosidase A Migalastat DCL000166 Fabry Disease Phase III
Calcitonin gene-related peptide 1 Olcegepant DCL000187 Migraine and Cluster Headaches Discontinued in Phase I/II
Cizolirtine DCL000753 Neuropathic pain Phase II
Heat shock protein HSP 90 Alvespimycin hydrochloride DCL000035 Ovarian Cancer, Refractory Hematological Malignancies Phase I
Refractory acute myelogenous leukemia; HER2-positive Metastatic Breast Cancer and Leukaemia Terminated in Phase II
AT13387 DCL000057 Cancer/Tumors Phase I
CNF1010 DCL000089 Solid Tumors, Chronic Myelogenous Leukemia Terminated in Phase I
IPI-504 DCL000137 Gastrointestinal Stromal Tumors Phase I
Non-small Cell Lung Cancer Phase I/II
Solid Tumors Phase Ib
Prostate Cancer Phase II
SNX-5422 DCL000231 Hematological Malignancies Phase I
STA-9090 DCL000236 Solid Tumors Phase I
Tanespimycin DCL000242 Breast Cancer, Melanoma Phase II
Multiple Myeloma Suspended in Phase III
Cathepsin G Dermolastin DCL000019 Chronic Obstructive Pulmonary Disease Halted in Phase I
Atopic Dermatitis, Alpha 1 Antitrypsin Deficiency Phase II
Emphysema Halted in Phase I
Integrin alpha-5 JSM 6427 DCL000012 Macular Degeneration Phase I
mRNA of Myb proto-oncogene protein LR3001 DCL000154 Myeloid Leukemia Phase II
Lysosomal alpha-glucosidase Celgosivir DCL000082 Hepatitis C Phase II
Glucobay DCL000309 Diabetes Mellitus Type 2 Phase IV
Basic fibroblast growth factor receptor 1 FGF-1 DCL000113 Peripheral Vascular Disease, Ulcers Phase I
Severe Coronary Heart Disease Phase II
SU-6668 DCL000342 Advanced solid tumours Discontinued
Phospholipase A2, membrane associated Varespladib DCL000258 Coronary Artery Disease, Atherosclerosis Phase II
Interleukin-2 receptor subunit beta Medusa IL-2 DCL000164 Cancer/Tumors Phase I/II
Histidine decarboxylase BF-Derm1 DCL000066 Skin Infections/Disorders Phase II
Tissue kallikrein Dermolastin DCL000019 Chronic Obstructive Pulmonary Disease Halted in Phase I
Atopic Dermatitis, Alpha 1 Antitrypsin Deficiency Phase II
Emphysema Halted in Phase I
Atrial natriuretic peptide receptor B CD-NP DCL000081 Myocardial infarction, Heart Disease Phase Ia
P-glycoprotein LY335979 DCL000157 Acute Myeloid Leukemia Phase III completed
Integrin beta-7 RhuMAb Beta7 DCL000622 Ulcerative colitis Phase I
Vedolizmab DCL000662 Ulcerative colitis, Crohn's d

and there is one error message after printing all entries:


Code:
MSX-122 DCL000173 Late-stage Solid Tumors Suspended in Phase I
KRH-2731 DPR000144 HIV Infection Preclinical
C-C chemokine receptor type 2 CCX915 DCL000080 Multiple Sclerosis Phase I
INCB3284 DCL000135 Rheumatoid Arthritis Discontinued in Phase II
Obese Insulin-resistant Subjects Discontinued in Phase IIa
INCB8696 DCL000546 Multiple scierosis Phase I
INCB-3284 DCL000845 Rheumatoid arthritis Discontinued in Phase I
MLN1202 DCL000883 Multiple Sclerosis Phase II completed
Metastatic Cancer; Unspecified Adult Solid Tumor, Protocol Specific Phase II
MCP-1 DPR000072 Rheumatoid arthritis Preclinical
RS-504393 DPR000102 Chronic obstructive pulmonary disease Preclinical
Traceback (most recent call last):
  File "./duplicate.py", line 20, in ?
    while left[index] == right[index]:
IndexError: list index out of range
bash-3.2$ 

Kindly cehck it

Last edited by Scrutinizer; 07-26-2012 at 08:36 AM..
# 11  
Old 07-26-2012
Hmm i think the problem here is the sample data we are using to build our code is not formatted the same as the raw data you are parsing. Can you please upload your data sets?
# 12  
Old 07-26-2012
Request to check

hmm

here is attached dataset!
Mani
This User Gave Thanks to manigrover For This Post:
# 13  
Old 07-26-2012
Yep as i thought, it's tab delimited Smilie
This User Gave Thanks to shitson For This Post:
# 14  
Old 07-26-2012
Request to check

can soemthing be done?
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Remove duplicates according to their frequency in column

Hi all, I have huge a tab-delimited file with the following format and I want to remove the duplicates according to their frequency based on Column2 and Column3. Column1 Column2 Column3 Column4 Column5 Column6 Column7 1 user1 access1 word word 3 2 2 user2 access2 ... (10 Replies)
Discussion started by: corfuitl
10 Replies

2. Shell Programming and Scripting

Remove duplicates within row and separate column

Hi all I have following kind of input file ESR1 PA156 leflunomide PA450192 leflunomide CHST3 PA26503 docetaxel Pa4586; thalidomide Pa34958; decetaxel docetaxel docetaxel I want to remove duplicates and I want to separate anything before and after PAxxxx entry into columns or... (1 Reply)
Discussion started by: manigrover
1 Replies

3. Shell Programming and Scripting

Request to check:remove duplicates and write sytematically

Hi all I have a file with following input It contains 5 columns gene name drug drug ID disease approved Now the same gene is repeated many times with different data in column2,3 ,4,5 I want to arrange dat in such a way that there shuld be one entry in the column(no... (2 Replies)
Discussion started by: manigrover
2 Replies

4. Shell Programming and Scripting

Request to check remove duplicates but write before it

Hi alll I have a file with following kind input I want in output duplicates should not be there but there should be numbering mentioned before that like (4 Replies)
Discussion started by: manigrover
4 Replies

5. Shell Programming and Scripting

Request to check:Remove duplicates

Hi all I have a file with following kind of data I want to remove duplicates according to first column so that output contains Kindly let me scripting regading this. (4 Replies)
Discussion started by: manigrover
4 Replies

6. Shell Programming and Scripting

Request to check:remove entries more than once in different column

Hi I have a file 12m 345693460 12 1234 12 1234 34 345 34 345 And I want output fiel as 12m 345693460 12 1234 34 345 hw can it be done Thanks (1 Reply)
Discussion started by: manigrover
1 Replies

7. Shell Programming and Scripting

Request to check:remove entries with N/A mentioned

Hi I have a file with following entries 122 N/A 123 5654656 123423 43534543 4544 45435 435454 N/A i Have to remove entries with N/A so that only 123 5654656 123423 43534543 4544 45435 remain in output file can anybody guide for a code/unix/perl (2 Replies)
Discussion started by: manigrover
2 Replies

8. Shell Programming and Scripting

Request to check:remove entries more than once

Hi I have a file like this 1234 2345 567889 567889 2345 234899420 83743 2345 67890 67890 ................ so on I want to delete entries which are more than once like 2345, 567889 and 67890 so that these appear once (4 Replies)
Discussion started by: manigrover
4 Replies

9. Shell Programming and Scripting

remove duplicates based on single column

Hello, I am new to shell scripting. I have a huge file with multiple columns for example: I have 5 columns below. HWUSI-EAS000_29:1:105 + chr5 76654650 AATTGGAA HHHHG HWUSI-EAS000_29:1:106 + chr5 76654650 AATTGGAA B@HYL HWUSI-EAS000_29:1:108 + ... (4 Replies)
Discussion started by: Diya123
4 Replies

10. Shell Programming and Scripting

need to remove duplicates based on key in first column and pattern in last column

Given a file such as this I need to remove the duplicates. 00060011 PAUL BOWSTEIN ad_waq3_921_20100826_010517.txt 00060011 PAUL BOWSTEIN ad_waq3_921_20100827_010528.txt 0624-01 RUT CORPORATION ad_sade3_10_20100827_010528.txt 0624-01 RUT CORPORATION ... (13 Replies)
Discussion started by: script_op2a
13 Replies
Login or Register to Ask a Question