Unix/Linux Go Back    


Programming Post questions about C, C++, Java, SQL, and other programming languages here.

Appending a column in xlsx file using Python

Programming


Tags
append, excel, openpyxl, overwrite, python

Reply    
 
Thread Tools Search this Thread Display Modes
    #15  
Old Unix and Linux 06-26-2017   -   Original Discussion by nans
nans's Unix or Linux Image
nans nans is offline
Registered User
 
Join Date: Mar 2013
Last Activity: 30 October 2017, 6:30 AM EDT
Posts: 74
Thanks: 39
Thanked 0 Times in 0 Posts
Thank you! That makes sense. So now when I print dict_pos, it seems to have formed a dictionary (not posting all of it as its large


Code:
/usr/bin/python2.7 /home/test/annotate.py
S12.xlsx
{'4300': '5', '3921': '1', '9072': '1', '16343': '1', '14007': '1', '13759': '1', '14911': '1', '14178': '1', '14179': '1', '16140': '1', '13359': '1', '4024': '1', '4025': '1'}

But when writing to the excel worksheet, instead of writing the score, all values end up as 'Unknown_June2017'

Also, is it possible to form a dictionary with multiple keys ? For example, I need the first three columns in the 'score.txt' to be associated with the score value and that needs to be compared with column 5,6,7 from the worksheet


Code:
dict_pos[x[0],[1],[2]] = x[3]


Last edited by nans; 06-26-2017 at 01:22 PM..
Sponsored Links
    #16  
Old Unix and Linux 06-26-2017   -   Original Discussion by nans
durden_tyler's Unix or Linux Image
durden_tyler durden_tyler is offline Forum Advisor  
Registered User
 
Join Date: Apr 2009
Last Activity: 9 September 2017, 1:30 PM EDT
Posts: 2,083
Thanks: 21
Thanked 383 Times in 346 Posts
Quote:
Originally Posted by nans View Post
...But when writing to the excel worksheet, instead of writing the score, all values end up as 'Unknown_June2017'

Also, is it possible to form a dictionary with multiple keys ? For example, I need the first three columns in the 'score.txt' to be associated with the score value and that needs to be compared with column 5,6,7 from the worksheet


Code:
dict_pos[x[0],[1],[2]] = x[3]
My hunch is that you are looking at the wrong column.
If your "pos_col_no" is "F" and "row_no" is 4, then the code will look at cells F4, F5, F6, F7, F8, .... and check if they are keys of dictionary "dpos".

Since you see 'Unknown_June2017' in cells V4, V5, V6, V7, V8, ... it means that the keys are not in column F but in some other column.

Yes, it's possible to form a dictionary with multiple keys.
You can use a special Python data structure called a "tuple" for that.
Elements of tuples have parentheses around them e.g.
Code:
('a', 'b', 'c')

is a tuple.
The code can work without parentheses (for the most part) but it's better to specify them in order to avoid ambiguity.
Like so:


Code:
 >>>
>>>
>>> color_mix = {}
>>>
>>> color_mix['red', 'blue'] = 'purple'          # works without parentheses
>>>
>>> color_mix[('blue', 'yellow')] = 'green'      # although it's customary to use them
>>>
>>> color_mix['yellow', 'red'] = 'orange'
>>>
>>> for k in color_mix.keys():
...     print k
...
('blue', 'yellow')
('red', 'blue')
('yellow', 'red')
>>>
>>>

Comparing tuples is easy:


Code:
 >>>
>>> tuple1 = ('cat', 'dog')
>>> tuple2 = ('dog', 'rat')
>>> tuple3 = ('cat', 'dog')
>>>
>>> tuple1 == tuple2
False
>>>
>>> tuple1 == tuple3
True
>>>
>>> ('hog', 'eel') == tuple1    # Works like this too
False
>>>
>>>

But this is where parentheses are important:


Code:
 >>>
>>> 'hog', 'eel' == tuple3      # Nope! Not what you would expect!
('hog', False)
>>>
>>> ('hog', 'eel') == ('dog', 'rat')  # Use parentheses to avoid surprises
False
>>>
>>>

The Following User Says Thank You to durden_tyler For This Useful Post:
nans (07-05-2017)
Sponsored Links
    #17  
Old Unix and Linux 06-27-2017   -   Original Discussion by nans
nans's Unix or Linux Image
nans nans is offline
Registered User
 
Join Date: Mar 2013
Last Activity: 30 October 2017, 6:30 AM EDT
Posts: 74
Thanks: 39
Thanked 0 Times in 0 Posts
Thank you.
yes, i was using the wrong column! Now this is my final code


Code:
#!/usr/bin/python

import sys
import os
from openpyxl import load_workbook
from datetime import datetime
from pandas import read_table
import csv
from collections import namedtuple

# Variables
sheet_directory = r'/home/test'

# Subroutines
def get_text_data(txt_filename):
    dict_pos = {}
    Scores = namedtuple("Scores", ["POS", "ALT", "REF"])
    first_line = True
    with open('/home/test/scores.txt') as txt_filename:
        for line in txt_filename:
            if first_line:
                first_line = False
                continue
            line = line.rstrip('\n')
            x = line.split('\t')
            cpos = Scores(POS=x[0], ALT=x[2], REF=x[1])
            dict_pos[cpos] = x[3]
        print dict_pos          
        return dict_pos


def process_xl_sheets():
    for sheet_root, sheet_dirs, sheet_files in os.walk(sheet_directory):
        for sheet_file in sheet_files:
            if sheet_file.endswith('.xlsx'):
                print(sheet_file)
                dpos = get_text_data(sheet_file.replace('.xlsx', '.txt'))  ##what exactly is this part doing ? There is only one text file 'score.txt' to be referenced against several xlsx files named S12.xlsx , S13.xlsx etc
                sheet_xl_file = os.path.join(sheet_root, sheet_file)
                wb = load_workbook(sheet_xl_file, data_only=True)
                ws = wb.get_sheet_by_name('raw_data')
                pos_col_no = 'E'
                alt_col_no = 'G'
                ref_col_no = 'F'
                score_col_no = 'V'
                row_no = 4
                #compare = Scores(POS=pos_col_no, ALT=alt_col_no, REF=ref_col_no)
                #cell = ws[compare + str(row_no) ]
                cell = ws[pos_col_no + alt_col_no + ref_col_no + str(row_no)]
                print cell.value           ##doesn't print
                while cell.value:
                    if str(cell.value) in dpos:
                        ws[score_col_no + str(row_no)] = dpos[str(cell.value)]
                    else:
                        ws[score_col_no + str(row_no)] = 'Unknown_' + datetime.now().strftime("%B") + datetime.now().strftime("%Y")
                        row_no += 1
                        cell = ws[pos_col_no + str(row_no)]
                wb.save(sheet_xl_file)

# Main section
process_xl_sheets()

when I run it now, it doesnt print the cell.value. So when I'm attempting to do is making the code to compare the threee columns in the excel file to the three columns in the text file so that it can output its corresponding Score


Code:
/usr/bin/python2.7 /home/test/annotate.py
S12.xlsx
{Scores(POS='73', ALT='C', REF='CN'): 'A', Scores(POS='497', ALT='C', REF='T'): '1', Scores(POS='2196', ALT='T', REF='C'): '1', Scores(POS='2080', ALT='C', REF='A'): '1', Scores(POS='2456', ALT='C', REF='T'): '1'}
None


Last edited by nans; 06-27-2017 at 09:35 AM.. Reason: updated code
    #18  
Old Unix and Linux 06-27-2017   -   Original Discussion by nans
durden_tyler's Unix or Linux Image
durden_tyler durden_tyler is offline Forum Advisor  
Registered User
 
Join Date: Apr 2009
Last Activity: 9 September 2017, 1:30 PM EDT
Posts: 2,083
Thanks: 21
Thanked 383 Times in 346 Posts
I haven't gone through the entire code yet, and it might be tricky to test your code since I don't have pandas, but let me answer your question about the function call.

Quote:
Originally Posted by nans View Post
...

Code:
#!/usr/bin/python
...
...
 # Subroutines
def get_text_data(txt_filename):
    dict_pos = {}
    Scores = namedtuple("Scores", ["POS", "ALT", "REF"])
    first_line = True
    with open('/home/test/scores.txt') as txt_filename:
        for line in txt_filename:
            if first_line:
                first_line = False
                continue
            line = line.rstrip('\n')
            x = line.split('\t')
            cpos = Scores(POS=x[0], ALT=x[2], REF=x[1])
            dict_pos[cpos] = x[3]
        print dict_pos          
        return dict_pos


def process_xl_sheets():
    for sheet_root, sheet_dirs, sheet_files in os.walk(sheet_directory):
        for sheet_file in sheet_files:
            if sheet_file.endswith('.xlsx'):
                print(sheet_file)
                dpos = get_text_data(sheet_file.replace('.xlsx', '.txt'))  ##what exactly is this part doing ? There is only one text file 'score.txt' to be referenced against several xlsx files named S12.xlsx , S13.xlsx etc
                sheet_xl_file = os.path.join(sheet_root, sheet_file)
...
 ...
# Main section
process_xl_sheets()

...
...
The replace() function replaces the value '.xlsx' to '.txt' in the string variable sheet_file.
The string variable sheet_file holds the name of your Excel file.
So, let's say while looping through the sheet_directory, your Python program finds an Excel file called "S12.xlsx". Then sheet_file will equal "S12.xlsx".

Thereafter, this expression:

Code:
sheet_file.replace('.xlsx', '.txt')

replaces '.xlsx' to '.txt' and thereby returns 'S12.txt'.

And then this value 'S12.txt' is passed to the function get_text_data().
That is, the value of the string parameter txt_filename is 'S12.txt'.

You can see this very quickly by printing txt_filename the moment you enter the function.

In the "with" statement inside the function "get_text_data", however, you use the same name txt_filename. That converts the string parameter txt_filename to a file object.

Thereafter, till the end of the function "get_text_data", txt_filename remains a file object.
So essentially, you are not using the txt_filename parameter in your function at all.

My suggestion: don't pass a parameter to a function if you are not using it at all. You anyway have the text file name ("scores.txt") and location hard-coded.
If something is not needed, discard it. Keep it simple.
The Following User Says Thank You to durden_tyler For This Useful Post:
nans (07-05-2017)
Sponsored Links
    #19  
Old Unix and Linux 06-27-2017   -   Original Discussion by nans
nans's Unix or Linux Image
nans nans is offline
Registered User
 
Join Date: Mar 2013
Last Activity: 30 October 2017, 6:30 AM EDT
Posts: 74
Thanks: 39
Thanked 0 Times in 0 Posts
Ah no worries.

If I remove that, then what would be the best way to proceed further


Code:
#dpos = get_text_data(sheet_file.replace('.xlsx', '.txt'))
                sheet_xl_file = os.path.join(sheet_root, sheet_file)
                wb = load_workbook(sheet_xl_file, data_only=True)
                ws = wb.get_sheet_by_name('Unannotated')
                pos_col_no = 'E'
                alt_col_no = 'G'
                ref_col_no = 'F'
                score_col_no = 'V'
                row_no = 4
                #compare = Scores(POS=pos_col_no, ALT=alt_col_no, REF=ref_col_no)
                #cell = ws[compare + str(row_no) ]
                cell = ws[pos_col_no + alt_col_no + ref_col_no + str(row_no)]
                print cell.value
                while cell.value:
                    if str(cell.value) in dpos:
                        ws[score_col_no + str(row_no)] = dpos[str(cell.value)]
                    else:
                        ws[score_col_no + str(row_no)] = 'Unknown_' + datetime.now().strftime("%B") + datetime.now().strftime("%Y")
                        row_no += 1
                        cell = ws[pos_col_no + str(row_no)]

Sponsored Links
    #20  
Old Unix and Linux 06-27-2017   -   Original Discussion by nans
durden_tyler's Unix or Linux Image
durden_tyler durden_tyler is offline Forum Advisor  
Registered User
 
Join Date: Apr 2009
Last Activity: 9 September 2017, 1:30 PM EDT
Posts: 2,083
Thanks: 21
Thanked 383 Times in 346 Posts
Quote:
Originally Posted by nans View Post
...
If I remove that, then what would be the best way to proceed further
...

Code:
#dpos = get_text_data(sheet_file.replace('.xlsx', '.txt'))
                sheet_xl_file = os.path.join(sheet_root, sheet_file)
                wb = load_workbook(sheet_xl_file, data_only=True)
                ws = wb.get_sheet_by_name('Unannotated')
                pos_col_no = 'E'
                alt_col_no = 'G'
                ref_col_no = 'F'
                score_col_no = 'V'
                row_no = 4
                #compare = Scores(POS=pos_col_no, ALT=alt_col_no, REF=ref_col_no)
                #cell = ws[compare + str(row_no) ]
                cell = ws[pos_col_no + alt_col_no + ref_col_no + str(row_no)]
            print cell.value
                while cell.value:
                    if str(cell.value) in dpos:
                        ws[score_col_no + str(row_no)] = dpos[str(cell.value)]
                    else:
                        ws[score_col_no + str(row_no)] = 'Unknown_' + datetime.now().strftime("%B") + datetime.now().strftime("%Y")
                        row_no += 1
                        cell = ws[pos_col_no + str(row_no)]
1) Remove the parameter, not the function call. In other words, do call the function, but don't pass any parameter to it.

2) Change the signature of the function so that it does not accept any parameter.

3) Call the function only once. Calling the same function that reads the same text file and returns the same dictionary every time you find an Excel spreadsheet is extremely inefficient.

4) Check what you are trying to print.

Code:
pos_col_no = 'E'
alt_col_no = 'G'
ref_col_no = 'F'
row_no = 4

So,

Code:
pos_col_no + alt_col_no + ref_col_no + str(row_no) = 'EGF4'

There is no cell called "EGF4" in any Excel spreadsheet.
Hence Python cannot print it.
>>> Ok, that's a wrong statement. My bad, sorry. Looks like newer versions of Microsoft Excel do have cell "EGF4".
>>> That may not be the cell you want to print. My guess is that your program is printing "None" instead of nothing.
>>> Your spreadsheet's cell "EGF4" is empty, most likely.

Last edited by durden_tyler; 06-27-2017 at 06:15 PM..
The Following User Says Thank You to durden_tyler For This Useful Post:
nans (07-05-2017)
Sponsored Links
    #21  
Old Unix and Linux 06-27-2017   -   Original Discussion by nans
nans's Unix or Linux Image
nans nans is offline
Registered User
 
Join Date: Mar 2013
Last Activity: 30 October 2017, 6:30 AM EDT
Posts: 74
Thanks: 39
Thanked 0 Times in 0 Posts
Quote:
Originally Posted by durden_tyler View Post
1) Remove the parameter, not the function call. In other words, do call the function, but don't pass any parameter to it.

2) Change the signature of the function so that it does not accept any parameter.

3) Call the function only once. Calling the same function that reads the same text file and returns the same dictionary every time you find an Excel spreadsheet is extremely inefficient.

4) Check what you are trying to print.

Code:
pos_col_no = 'E'
alt_col_no = 'G'
ref_col_no = 'F'
row_no = 4

So,

Code:
pos_col_no + alt_col_no + ref_col_no + str(row_no) = 'EGF4'

There is no cell called "EGF4" in any Excel spreadsheet.
Hence Python cannot print it.
>>> Ok, that's a wrong statement. My bad, sorry. Looks like newer versions of Microsoft Excel do have cell "EGF4".
>>> That may not be the cell you want to print. My guess is that your program is printing "None" instead of nothing.
>>> Your spreadsheet's cell "EGF4" is empty, most likely.
I'm using Libre Office, an older version, not microsoft excel. So what would be the appropriate way of calling E4:G4:F4 -> V4

What do you mean by point 2 ? Will.it then be
dpos = get_text_data(txt_filename)

Last edited by nans; 06-27-2017 at 07:40 PM..
Sponsored Links
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Linux More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
Python soap and string to .xlsx conversion timj123 Shell Programming and Scripting 8 06-09-2017 05:09 PM
Appending = in particular column in csv file Divya1987 Shell Programming and Scripting 2 01-15-2013 09:50 AM
appending column file f_o_555 Shell Programming and Scripting 4 03-05-2009 04:09 AM
Appending 'string' to file as first column. satyam_sat Shell Programming and Scripting 6 02-20-2009 05:15 AM
Appending a column in one file to the corresponding line in a second suzannef Shell Programming and Scripting 3 01-12-2009 05:42 PM



All times are GMT -4. The time now is 12:06 PM.