Appending a column in xlsx file using Python

06-26-2017

Registered User

91, 0

Join Date: Mar 2013

Last Activity: 30 March 2020, 3:20 AM EDT

Posts: 91

Thanks Given: 45

Thanked 0 Times in 0 Posts

Thank you! That makes sense. So now when I print dict_pos, it seems to have formed a dictionary (not posting all of it as its large

Code:

/usr/bin/python2.7 /home/test/annotate.py
S12.xlsx
{'4300': '5', '3921': '1', '9072': '1', '16343': '1', '14007': '1', '13759': '1', '14911': '1', '14178': '1', '14179': '1', '16140': '1', '13359': '1', '4024': '1', '4025': '1'}

But when writing to the excel worksheet, instead of writing the score, all values end up as 'Unknown_June2017'

Also, is it possible to form a dictionary with multiple keys ? For example, I need the first three columns in the 'score.txt' to be associated with the score value and that needs to be compared with column 5,6,7 from the worksheet

Code:

dict_pos[x[0],[1],[2]] = x[3]

Last edited by nans; 06-26-2017 at 01:22 PM..

nans

View Public Profile for nans

Find all posts by nans

06-26-2017

Registered User

2,100, 402

Join Date: Apr 2009

Last Activity: 11 February 2020, 10:24 AM EST

Posts: 2,100

Thanks Given: 26

Thanked 402 Times in 360 Posts

Quote:

Originally Posted by nans

...But when writing to the excel worksheet, instead of writing the score, all values end up as 'Unknown_June2017'

Also, is it possible to form a dictionary with multiple keys ? For example, I need the first three columns in the 'score.txt' to be associated with the score value and that needs to be compared with column 5,6,7 from the worksheet

Code:

dict_pos[x[0],[1],[2]] = x[3]

My hunch is that you are looking at the wrong column.
If your "pos_col_no" is "F" and "row_no" is 4, then the code will look at cells F4, F5, F6, F7, F8, .... and check if they are keys of dictionary "dpos".

Since you see 'Unknown_June2017' in cells V4, V5, V6, V7, V8, ... it means that the keys are not in column F but in some other column.

Yes, it's possible to form a dictionary with multiple keys.
You can use a special Python data structure called a "tuple" for that.
Elements of tuples have parentheses around them e.g.

Code:

('a', 'b', 'c')

is a tuple.
The code can work without parentheses (for the most part) but it's better to specify them in order to avoid ambiguity.
Like so:

Code:

 >>>
>>>
>>> color_mix = {}
>>>
>>> color_mix['red', 'blue'] = 'purple'          # works without parentheses
>>>
>>> color_mix[('blue', 'yellow')] = 'green'      # although it's customary to use them
>>>
>>> color_mix['yellow', 'red'] = 'orange'
>>>
>>> for k in color_mix.keys():
...     print k
...
('blue', 'yellow')
('red', 'blue')
('yellow', 'red')
>>>
>>>

Comparing tuples is easy:

Code:

 >>>
>>> tuple1 = ('cat', 'dog')
>>> tuple2 = ('dog', 'rat')
>>> tuple3 = ('cat', 'dog')
>>>
>>> tuple1 == tuple2
False
>>>
>>> tuple1 == tuple3
True
>>>
>>> ('hog', 'eel') == tuple1    # Works like this too
False
>>>
>>>

But this is where parentheses are important:

Code:

 >>>
>>> 'hog', 'eel' == tuple3      # Nope! Not what you would expect!
('hog', False)
>>>
>>> ('hog', 'eel') == ('dog', 'rat')  # Use parentheses to avoid surprises
False
>>>
>>>

This User Gave Thanks to durden_tyler For This Post:

durden_tyler

View Public Profile for durden_tyler

Find all posts by durden_tyler

06-27-2017

Registered User

91, 0

Join Date: Mar 2013

Last Activity: 30 March 2020, 3:20 AM EDT

Posts: 91

Thanks Given: 45

Thanked 0 Times in 0 Posts

Thank you.
yes, i was using the wrong column! Now this is my final code

Code:

#!/usr/bin/python

import sys
import os
from openpyxl import load_workbook
from datetime import datetime
from pandas import read_table
import csv
from collections import namedtuple

# Variables
sheet_directory = r'/home/test'

# Subroutines
def get_text_data(txt_filename):
    dict_pos = {}
    Scores = namedtuple("Scores", ["POS", "ALT", "REF"])
    first_line = True
    with open('/home/test/scores.txt') as txt_filename:
        for line in txt_filename:
            if first_line:
                first_line = False
                continue
            line = line.rstrip('\n')
            x = line.split('\t')
            cpos = Scores(POS=x[0], ALT=x[2], REF=x[1])
            dict_pos[cpos] = x[3]
        print dict_pos          
        return dict_pos


def process_xl_sheets():
    for sheet_root, sheet_dirs, sheet_files in os.walk(sheet_directory):
        for sheet_file in sheet_files:
            if sheet_file.endswith('.xlsx'):
                print(sheet_file)
                dpos = get_text_data(sheet_file.replace('.xlsx', '.txt'))  ##what exactly is this part doing ? There is only one text file 'score.txt' to be referenced against several xlsx files named S12.xlsx , S13.xlsx etc
                sheet_xl_file = os.path.join(sheet_root, sheet_file)
                wb = load_workbook(sheet_xl_file, data_only=True)
                ws = wb.get_sheet_by_name('raw_data')
                pos_col_no = 'E'
                alt_col_no = 'G'
                ref_col_no = 'F'
                score_col_no = 'V'
                row_no = 4
                #compare = Scores(POS=pos_col_no, ALT=alt_col_no, REF=ref_col_no)
                #cell = ws[compare + str(row_no) ]
                cell = ws[pos_col_no + alt_col_no + ref_col_no + str(row_no)]
                print cell.value           ##doesn't print
                while cell.value:
                    if str(cell.value) in dpos:
                        ws[score_col_no + str(row_no)] = dpos[str(cell.value)]
                    else:
                        ws[score_col_no + str(row_no)] = 'Unknown_' + datetime.now().strftime("%B") + datetime.now().strftime("%Y")
                        row_no += 1
                        cell = ws[pos_col_no + str(row_no)]
                wb.save(sheet_xl_file)

# Main section
process_xl_sheets()

when I run it now, it doesnt print the cell.value. So when I'm attempting to do is making the code to compare the threee columns in the excel file to the three columns in the text file so that it can output its corresponding Score

Code:

/usr/bin/python2.7 /home/test/annotate.py
S12.xlsx
{Scores(POS='73', ALT='C', REF='CN'): 'A', Scores(POS='497', ALT='C', REF='T'): '1', Scores(POS='2196', ALT='T', REF='C'): '1', Scores(POS='2080', ALT='C', REF='A'): '1', Scores(POS='2456', ALT='C', REF='T'): '1'}
None

Last edited by nans; 06-27-2017 at 09:35 AM.. Reason: updated code

nans

View Public Profile for nans

Find all posts by nans

06-27-2017

Registered User

2,100, 402

Join Date: Apr 2009

Last Activity: 11 February 2020, 10:24 AM EST

Posts: 2,100

Thanks Given: 26

Thanked 402 Times in 360 Posts

I haven't gone through the entire code yet, and it might be tricky to test your code since I don't have pandas, but let me answer your question about the function call.

Quote:

Originally Posted by nans

...

Code:

#!/usr/bin/python
...
...
 # Subroutines
def get_text_data(txt_filename):
    dict_pos = {}
    Scores = namedtuple("Scores", ["POS", "ALT", "REF"])
    first_line = True
    with open('/home/test/scores.txt') as txt_filename:
        for line in txt_filename:
            if first_line:
                first_line = False
                continue
            line = line.rstrip('\n')
            x = line.split('\t')
            cpos = Scores(POS=x[0], ALT=x[2], REF=x[1])
            dict_pos[cpos] = x[3]
        print dict_pos          
        return dict_pos


def process_xl_sheets():
    for sheet_root, sheet_dirs, sheet_files in os.walk(sheet_directory):
        for sheet_file in sheet_files:
            if sheet_file.endswith('.xlsx'):
                print(sheet_file)
                dpos = get_text_data(sheet_file.replace('.xlsx', '.txt'))  ##what exactly is this part doing ? There is only one text file 'score.txt' to be referenced against several xlsx files named S12.xlsx , S13.xlsx etc
                sheet_xl_file = os.path.join(sheet_root, sheet_file)
...
 ...
# Main section
process_xl_sheets()

...
...

The replace() function replaces the value '.xlsx' to '.txt' in the string variable sheet_file.
The string variable sheet_file holds the name of your Excel file.
So, let's say while looping through the sheet_directory, your Python program finds an Excel file called "S12.xlsx". Then sheet_file will equal "S12.xlsx".

Thereafter, this expression:

Code:

sheet_file.replace('.xlsx', '.txt')

replaces '.xlsx' to '.txt' and thereby returns 'S12.txt'.

And then this value 'S12.txt' is passed to the function get_text_data().
That is, the value of the string parameter txt_filename is 'S12.txt'.

You can see this very quickly by printing txt_filename the moment you enter the function.

In the "with" statement inside the function "get_text_data", however, you use the same name txt_filename. That converts the string parameter txt_filename to a file object.

Thereafter, till the end of the function "get_text_data", txt_filename remains a file object.
So essentially, you are not using the txt_filename parameter in your function at all.

My suggestion: don't pass a parameter to a function if you are not using it at all. You anyway have the text file name ("scores.txt") and location hard-coded.
If something is not needed, discard it. Keep it simple.

This User Gave Thanks to durden_tyler For This Post:

durden_tyler

View Public Profile for durden_tyler

Find all posts by durden_tyler

06-27-2017

Registered User

91, 0

Join Date: Mar 2013

Last Activity: 30 March 2020, 3:20 AM EDT

Posts: 91

Thanks Given: 45

Thanked 0 Times in 0 Posts

Ah no worries.

If I remove that, then what would be the best way to proceed further

Code:

#dpos = get_text_data(sheet_file.replace('.xlsx', '.txt'))
                sheet_xl_file = os.path.join(sheet_root, sheet_file)
                wb = load_workbook(sheet_xl_file, data_only=True)
                ws = wb.get_sheet_by_name('Unannotated')
                pos_col_no = 'E'
                alt_col_no = 'G'
                ref_col_no = 'F'
                score_col_no = 'V'
                row_no = 4
                #compare = Scores(POS=pos_col_no, ALT=alt_col_no, REF=ref_col_no)
                #cell = ws[compare + str(row_no) ]
                cell = ws[pos_col_no + alt_col_no + ref_col_no + str(row_no)]
                print cell.value
                while cell.value:
                    if str(cell.value) in dpos:
                        ws[score_col_no + str(row_no)] = dpos[str(cell.value)]
                    else:
                        ws[score_col_no + str(row_no)] = 'Unknown_' + datetime.now().strftime("%B") + datetime.now().strftime("%Y")
                        row_no += 1
                        cell = ws[pos_col_no + str(row_no)]

nans

View Public Profile for nans

Find all posts by nans

06-27-2017

Registered User

2,100, 402

Join Date: Apr 2009

Last Activity: 11 February 2020, 10:24 AM EST

Posts: 2,100

Thanks Given: 26

Thanked 402 Times in 360 Posts

Quote:

Originally Posted by nans

...
If I remove that, then what would be the best way to proceed further
...

Code:

#dpos = get_text_data(sheet_file.replace('.xlsx', '.txt'))
                sheet_xl_file = os.path.join(sheet_root, sheet_file)
                wb = load_workbook(sheet_xl_file, data_only=True)
                ws = wb.get_sheet_by_name('Unannotated')
                pos_col_no = 'E'
                alt_col_no = 'G'
                ref_col_no = 'F'
                score_col_no = 'V'
                row_no = 4
                #compare = Scores(POS=pos_col_no, ALT=alt_col_no, REF=ref_col_no)
                #cell = ws[compare + str(row_no) ]
                cell = ws[pos_col_no + alt_col_no + ref_col_no + str(row_no)]
            print cell.value
                while cell.value:
                    if str(cell.value) in dpos:
                        ws[score_col_no + str(row_no)] = dpos[str(cell.value)]
                    else:
                        ws[score_col_no + str(row_no)] = 'Unknown_' + datetime.now().strftime("%B") + datetime.now().strftime("%Y")
                        row_no += 1
                        cell = ws[pos_col_no + str(row_no)]

1) Remove the parameter, not the function call. In other words, do call the function, but don't pass any parameter to it.

2) Change the signature of the function so that it does not accept any parameter.

3) Call the function only once. Calling the same function that reads the same text file and returns the same dictionary every time you find an Excel spreadsheet is extremely inefficient.

4) Check what you are trying to print.

Code:

pos_col_no = 'E'
alt_col_no = 'G'
ref_col_no = 'F'
row_no = 4

So,

Code:

pos_col_no + alt_col_no + ref_col_no + str(row_no) = 'EGF4'

There is no cell called "EGF4" in any Excel spreadsheet.
Hence Python cannot print it.
>>> Ok, that's a wrong statement. My bad, sorry. Looks like newer versions of Microsoft Excel do have cell "EGF4".
>>> That may not be the cell you want to print. My guess is that your program is printing "None" instead of nothing.
>>> Your spreadsheet's cell "EGF4" is empty, most likely.

Last edited by durden_tyler; 06-27-2017 at 06:15 PM..

This User Gave Thanks to durden_tyler For This Post:

durden_tyler

View Public Profile for durden_tyler

Find all posts by durden_tyler

06-27-2017

Registered User

91, 0

Join Date: Mar 2013

Last Activity: 30 March 2020, 3:20 AM EDT

Posts: 91

Thanks Given: 45

Thanked 0 Times in 0 Posts

Quote:

Originally Posted by durden_tyler

Code:

pos_col_no = 'E'
alt_col_no = 'G'
ref_col_no = 'F'
row_no = 4

So,

Code:

pos_col_no + alt_col_no + ref_col_no + str(row_no) = 'EGF4'

I'm using Libre Office, an older version, not microsoft excel. So what would be the appropriate way of calling E4:G4:F4 -> V4

What do you mean by point 2 ? Will.it then be
dpos = get_text_data(txt_filename)

Last edited by nans; 06-27-2017 at 07:40 PM..

nans

View Public Profile for nans

Find all posts by nans

Programming

Appending a column in xlsx file using Python

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

How to insert data into black column( Secound Column ) in excel (.XLSX) file using shell script?

Discussion started by: Shubham1182

2. Shell Programming and Scripting

Python soap and string to .xlsx conversion

Discussion started by: timj123

3. Shell Programming and Scripting

Appending = in particular column in csv file

Discussion started by: Divya1987

4. Shell Programming and Scripting

Appending column to rows

Discussion started by: unme

5. UNIX for Dummies Questions & Answers

Appending a column of numbers in ascending order to a text file

Discussion started by: evelibertine

6. Shell Programming and Scripting

Appending new column to existing files

Discussion started by: ida1215

7. UNIX for Dummies Questions & Answers

Appending date value mmdd to first column in file

Discussion started by: kalyansid

8. Shell Programming and Scripting

appending column file

Discussion started by: f_o_555

9. Shell Programming and Scripting

Appending 'string' to file as first column.

Discussion started by: satyam_sat

10. Shell Programming and Scripting

Appending a column in one file to the corresponding line in a second

Discussion started by: suzannef