Unix/Linux Go Back    


Programming Post questions about C, C++, Java, SQL, and other programming languages here.

Appending a column in xlsx file using Python

Programming


Tags
append, excel, openpyxl, overwrite, python

Closed    
 
Thread Tools Search this Thread Display Modes
    #1  
Old Unix and Linux 06-21-2017   -   Original Discussion by nans
nans's Unix or Linux Image
nans nans is offline
Registered User
 
Join Date: Mar 2013
Last Activity: 12 April 2018, 7:45 AM EDT
Posts: 76
Thanks: 41
Thanked 0 Times in 0 Posts
Appending a column in xlsx file using Python

Is there a way to append an existing xlsx worksheet to add data from a text file ?
I have an excel file for which I need to manipulate the first worksheet based on a text file.
I match the text file to the xlsx and write the 'Scores' column in the xlsx sheet and save the workbook.
For those 'Pos' not present in the the text file, the excel sheet would have to print 'Unknown' with current date for it.


The tab-text file looks like this



Code:
Pos    Ref    Alt    Score
44    a    bb    1 
57    c    ab    4 
64    d    d    5

and the excel sheet has several columns



Code:
Col1.. Col2..   Pos    Ref    Alt ... Score... Col26
id2.. 57    c    ab...    ... 
id3.. 64    d    d...  ... 
id4.. 103  e g ...    ...

So the output will look like



Code:
Col1.. Col2..   Pos    Ref    Alt ... Score... Col26
 id2.. 57    c    ab...    4 ... 
id3.. 64    d    d... 5 ... 
id4.. 103  e g ...   Unknown_June2017 ...

What I am currently doing is converting the worksheet to text file, comparing the two text files and then writing the result text file back to excel workbook. As there are several excel files , this is a bit inefficient. Any help is appreciated. Thank you.

Last edited by nans; 06-22-2017 at 09:38 AM..
Sponsored Links
    #2  
Old Unix and Linux 06-22-2017   -   Original Discussion by nans
durden_tyler's Unix or Linux Image
durden_tyler durden_tyler is offline Forum Advisor  
Registered User
 
Join Date: Apr 2009
Last Activity: 9 May 2018, 9:38 PM EDT
Posts: 2,091
Thanks: 23
Thanked 388 Times in 351 Posts
Quote:
Originally Posted by nans View Post
Is there a way to append an existing xlsx worksheet to add data from a text file ?
I have an excel file for which I need to manipulate the first worksheet based on a text file.
I match the text file to the xlsx and write the 'Scores' column in the xlsx sheet and save the workbook.
For those 'Pos' not present in the the text file, the excel sheet would have to print 'Unknown' with current date for it.


The tab-text file looks like this



Code:
Pos    Ref    Alt    Score
44    a    bb    1 
57    c    ab    4 
64    d    d    5

and the excel sheet has several columns



Code:
Col1.. Col2..   Pos    Ref    Alt ... Score... Col26
id2.. 57    c    ab...    ... 
id3.. 64    d    d...  ... 
id4.. 103  e g ...    ...

So the output will look like



Code:
Col1.. Col2..   Pos    Ref    Alt ... Score... Col26
 id2.. 57    c    ab...    4 ... 
id3.. 64    d    d... 5 ... 
id4.. 103  e g ...   Unknown_June2017 ...

What I am currently doing is converting the worksheet to text file, comparing the two text files and then writing the result text file back to excel workbook. As there are several excel files , this is a bit inefficient. Any help is appreciated. Thank you.
Could you post the code you attempted?
Sponsored Links
    #3  
Old Unix and Linux 06-22-2017   -   Original Discussion by nans
nans's Unix or Linux Image
nans nans is offline
Registered User
 
Join Date: Mar 2013
Last Activity: 12 April 2018, 7:45 AM EDT
Posts: 76
Thanks: 41
Thanked 0 Times in 0 Posts
The code is very long, I have attached the relevant part where I am trying to copy the contents of the text file into the excel worksheet. But using this as well, I end up with a blank worksheet.



Code:
#!/usr/bin/python
 
import os
from openpyxl.reader.excel import load_workbook
import csv
from openpyxl.drawing.image import Image
import PIL
 
xl_directory = r'/home/test'
txt_directory = r'/home/test'
 
for xl_root, xl_dirs, xl_files in os.walk(xl_directory):
   for xl_file in xl_files:
       if xl_file.endswith('.xlsx'):
           xl_abs_file = os.path.join(xl_root, xl_file)
           wb = load_workbook(xl_abs_file, data_only=True)
           ws = wb.get_sheet_by_name('Unannotated')
           ##clear the contents of the file
           for row in ws['A4:U1000']:
               for cell in row:
                   cell.value = None
   image = Image('/home/logo3.jpg')
           ws.add_image(image, 'A1')
           ## go through text file and write data on worksheet
           for txt_root, txt_dirs, txt_files in os.walk(txt_directory):
               for txt_file in txt_files:
                   if txt_file == xl_file.replace('xlsx', 'txt'):
                       with open(os.path.join(txt_root, txt_file)) as fh:
                           reader = csv.reader(fh, delimiter='\t')
                           [next(reader) for skip in range(1)]
                           for row in reader:
                               ws.append(row)
                               wb.save(xl_abs_file)


Last edited by nans; 06-22-2017 at 11:23 AM..
    #4  
Old Unix and Linux 06-22-2017   -   Original Discussion by nans
durden_tyler's Unix or Linux Image
durden_tyler durden_tyler is offline Forum Advisor  
Registered User
 
Join Date: Apr 2009
Last Activity: 9 May 2018, 9:38 PM EDT
Posts: 2,091
Thanks: 23
Thanked 388 Times in 351 Posts
Quote:
Originally Posted by nans View Post
The code is very long, I have attached the relevant part where I am trying to copy the contents of the text file into the excel worksheet. But using this as well, I end up with a blank worksheet.



Code:
#!/usr/bin/python
 
import os
from openpyxl.reader.excel import load_workbook
import csv
from openpyxl.drawing.image import Image
import PIL
 
xl_directory = r'/home/test'
txt_directory = r'/home/test'
 
for xl_root, xl_dirs, xl_files in os.walk(xl_directory):
   for xl_file in xl_files:
       if xl_file.endswith('.xlsx'):
           xl_abs_file = os.path.join(xl_root, xl_file)
           wb = load_workbook(xl_abs_file, data_only=True)
           ws = wb.get_sheet_by_name('Unannotated')
           ##clear the contents of the file
           for row in ws['A4:U1000']:
               for cell in row:
                   cell.value = None
   image = Image('/home/logo3.jpg')
           ws.add_image(image, 'A1')
           ## go through text file and write data on worksheet
           for txt_root, txt_dirs, txt_files in os.walk(txt_directory):
               for txt_file in txt_files:
                   if txt_file == xl_file.replace('xlsx', 'txt'):
                       with open(os.path.join(txt_root, txt_file)) as fh:
                           reader = csv.reader(fh, delimiter='\t')
                           [next(reader) for skip in range(1)]
                           for row in reader:
                               ws.append(row)
                               wb.save(xl_abs_file)
Here's some code that you'd like to try.



Code:
#!python
import os
from openpyxl import load_workbook
from datetime import datetime
  
# Variables
sheet_directory = r'<path_of_Excel_files>'
text_directory = r'<path_of_text_files>'
  
# Subroutines
def get_text_data(txt_filename):
    dict_pos = {}
    first_line = True
    for text_root, text_dirs, text_files in os.walk(text_directory):
        for text_file in text_files:
            if text_file == txt_filename:
                # A matching text file was found
                fh = open(os.path.join(text_root, text_file))
                for line in fh:
                    # Skip the header; read the data into the dictionary
                    if first_line:
                        first_line = False
                        continue
                    line = line.rstrip('\n')
                    x = line.split('\t')
                    dict_pos[x[0]] = x[3]
    return dict_pos
  
def process_xl_sheets():
    for sheet_root, sheet_dirs, sheet_files in os.walk(sheet_directory):
        for sheet_file in sheet_files:
            if sheet_file.endswith('.xlsx'):
                # Read the corresponding text file from the text_directory and
                # populate a dictionary of "Pos" values.
                dpos = get_text_data(sheet_file.replace('.xlsx', '.txt'))
                sheet_xl_file = os.path.join(sheet_root, sheet_file)
                wb = load_workbook(sheet_xl_file, data_only=True)
                ws = wb.get_sheet_by_name('rawdata')
                # If you already know the columns that have the headers "Pos" and
                # "Score", set them here. Otherwise, iterate through the first row
                # to determine those columns.
                pos_col_no = 'C'
                score_col_no = 'F'
                row_no = 2
                cell = ws[pos_col_no + str(row_no)]
                while cell.value:
                    if str(cell.value) in dpos:
                        ws[score_col_no + str(row_no)] = dpos[str(cell.value)]
                    else:
                        ws[score_col_no + str(row_no)] = 'Unknown_' + datetime.now().strftime("%B") + datetime.now().strftime("%Y")
                    row_no += 1
                    cell = ws[pos_col_no + str(row_no)]
                wb.save(sheet_xl_file)
  
# Main section
process_xl_sheets()

A few things that come to mind as I look at the code again:
1) After "dpos" is assigned, you may want to do further processing only if dpos is not empty. Notice that dpos could if empty if a text file corresponding to an Excel file is not found. For such cases, it would be inefficient to work on the Excel spreadsheet at all.
2) In the "get_text_data" subroutine, you may want to process the first row and see if x[0] is "Pos" and x[3] is "Score". If not, then you can avoid processing the text file entirely.
3) If there is no worksheet called "rawdata", then continue to the next iteration of the loop.
3) If there are way too many Excels and text files (say hundreds or thousands or more), then you may want to first create a dictionary of Excel => text files and then iterate through the key/value pairs, processing them one-by-one. The existence of a file can be quickly checked using "os.path.isfile(<filename>)" - this avoids the unnecessary looping through the directory. In fact, coming to think of it, you can refactor the posted code and implement this concept to see if it improves the run time.

Last edited by durden_tyler; 06-22-2017 at 05:21 PM.. Reason: Added a few more thoughts on the code.
The Following User Says Thank You to durden_tyler For This Useful Post:
nans (07-05-2017)
Sponsored Links
    #5  
Old Unix and Linux 06-23-2017   -   Original Discussion by nans
nans's Unix or Linux Image
nans nans is offline
Registered User
 
Join Date: Mar 2013
Last Activity: 12 April 2018, 7:45 AM EDT
Posts: 76
Thanks: 41
Thanked 0 Times in 0 Posts
Thank you.

But I think i've not explained myself properly. the code I provided was just a more complicated way which did not seem efficient.

There is only one text file for scores with 4 columns (scores.txt) that is constant to compare several excel files. Those columns need to be compared with the 'raw data' worksheet (that also have constant headers and format) and add the scores accordingly.

I tried the code with a bit of manipulation but that also generates a blank workbook. I have attached a small snippet of the input files and expected output.

Your help is really appreciated, thank you
Attached Files
File Type: zip test.zip (14.2 KB, 18 views)
Sponsored Links
    #6  
Old Unix and Linux 06-23-2017   -   Original Discussion by nans
durden_tyler's Unix or Linux Image
durden_tyler durden_tyler is offline Forum Advisor  
Registered User
 
Join Date: Apr 2009
Last Activity: 9 May 2018, 9:38 PM EDT
Posts: 2,091
Thanks: 23
Thanked 388 Times in 351 Posts
Quote:
Originally Posted by nans View Post
...
...
There is only one text file for scores with 4 columns (scores.txt) that is constant to compare several excel files. Those columns need to be compared with the 'raw data' worksheet (that also have constant headers and format) and add the scores accordingly.

I tried the code with a bit of manipulation but that also generates a blank workbook. I have attached a small snippet of the input files and expected output.
...
Just a quick observation: the name of the text file inside your zip file is "scores.txt". The name of the Excel file inside your zip file is "S12.xlsx".
Now, the line in bold red color in your Python code:



Code:
#!/usr/bin/python
...
...
xl_directory = r'/home/test'
txt_directory = r'/home/test'
...
...

            for txt_root, txt_dirs, txt_files in os.walk(txt_directory):
               for txt_file in txt_files:
                  if txt_file == xl_file.replace('xlsx', 'txt'):
                       with open(os.path.join(txt_root, txt_file)) as fh:
...
...

looks for a text file that is named the same as an Excel file.
So, if your code finds an Excel file: "/home/test/myfile.xlsx", then it will look for a text file called "/home/test/myfile.txt" and process it.

A similar check in my Python code is below in bold red:



Code:
...
...
sheet_directory = r'<path_of_Excel_files>'
text_directory = r'<path_of_text_files>'
  
# Subroutines
def get_text_data(txt_filename):
    dict_pos = {}
    first_line = True
    for text_root, text_dirs, text_files in os.walk(text_directory):
        for text_file in text_files:
           if text_file == txt_filename:
                # A matching text file was found
...
...
   
def process_xl_sheets():
...
...
   
# Main section
process_xl_sheets()

As you can see, neither of the two pieces of code above will work if the Excel and text files are named differently.

1) When you tried my code with a bit of manipulation, did you ensure that it reads the file "scores.txt" and not "S12.txt" for the Excel file "S12.xlsx" ?
2) If you print dict_pos right before it is returned from the function "get_text_data", what do you see?
The Following User Says Thank You to durden_tyler For This Useful Post:
nans (07-05-2017)
Sponsored Links
    #7  
Old Unix and Linux 06-23-2017   -   Original Discussion by nans
nans's Unix or Linux Image
nans nans is offline
Registered User
 
Join Date: Mar 2013
Last Activity: 12 April 2018, 7:45 AM EDT
Posts: 76
Thanks: 41
Thanked 0 Times in 0 Posts
1) When you tried my code with a bit of manipulation, did you ensure that it reads the file "scores.txt" and not "S12.txt" for the Excel file "S12.xlsx" ?
Yes. its only one file 'scores.txt' that is being used as a reference to get the scores into all excel sheets.

2) If you print dict_pos right before it is returned from the function "get_text_data", what do you see?
It does not return anything

I have pasted the code I used below



Code:
import os
from openpyxl import load_workbook
from datetime import datetime
import csv
  
# Variables
sheet_directory = r'/home/test'
text_directory = r'/home/test'
  
# Subroutines
def get_text_data(txt_filename):
    dict_pos = {}
    first_line = True
    with open('scores.txt') as txt_filename:
        tab_reader = csv.reader(txt_filename, delimiter='\t')
        for line in tab_reader:
            if first_line:
                first_line = False
                continue
                line = line.rstrip('\n')
                x = line.split('\t')
                dict_pos[x[0]] = x[3]
                #print dict_pos
                return dict_pos


def process_xl_sheets():
    for sheet_root, sheet_dirs, sheet_files in os.walk(sheet_directory):
        for sheet_file in sheet_files:
            if sheet_file.endswith('.xlsx'):
                dpos = get_text_data(sheet_file.replace('.xlsx', '.txt'))
                sheet_xl_file = os.path.join(sheet_root, sheet_file)
                wb = load_workbook(sheet_xl_file, data_only=True)
                ws = wb.get_sheet_by_name('raw_data')
                pos_col_no = 'F'
                score_col_no = 'V'
                row_no = 4
                cell = ws[pos_col_no + str(row_no)]
                while cell.value:
                    if str(cell.value) in dpos:
                        ws[score_col_no + str(row_no)] = dpos[str(cell.value)]
                    else:
                        ws[score_col_no + str(row_no)] = 'Unknown_' + datetime.now().strftime("%B") + datetime.now().strftime("%Y")
                        row_no += 1
                        cell = ws[pos_col_no + str(row_no)]
                        wb.save(sheet_xl_file)

                # Main section
process_xl_sheets()

Sponsored Links
Closed

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Linux More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
Python soap and string to .xlsx conversion timj123 Shell Programming and Scripting 8 06-09-2017 04:09 PM
Appending = in particular column in csv file Divya1987 Shell Programming and Scripting 2 01-15-2013 08:50 AM
appending column file f_o_555 Shell Programming and Scripting 4 03-05-2009 03:09 AM
Appending 'string' to file as first column. satyam_sat Shell Programming and Scripting 6 02-20-2009 04:15 AM
Appending a column in one file to the corresponding line in a second suzannef Shell Programming and Scripting 3 01-12-2009 04:42 PM



All times are GMT -4. The time now is 03:01 PM.