Sponsored Content
Top Forums Programming Python script for extracting data using two files Post 302922744 by nans on Tuesday 28th of October 2014 06:42:29 AM
Old 10-28-2014
Python script for extracting data using two files

Hello,
I have two files.
File 1 is a list of interested IDs
Code:
Ex1
Ex2
Ex3

File 2 is the original file with over 8000 columns and 20 millions rows and is a compressed file .gz
Code:
Ex1 xx xx xx xx ....
Ex2 xx xx xx xx ....
Ex2 xx xx xx xx ....

Now I need to extract the information for all the IDs of interest from File 1. I have a script that should do that
Code:
import argparse
import gzip
if __name__ == '__main__':
    parser = argparse.ArgumentParser
    parser.add_argument('--file',action='store',dest='file',help="FILE2")
    parser.add_argument('--IDs', action='store',dest='ids',help='FILE1')
    parser.add_argument('--header', action='store_true',dest='header',help='TRUE or FALSE') 
    args = parser.parse_args()
    
    file = gzip.open(args.file, 'rb')
    idfile = open(args.ids, 'r')
    if(args.header):
        idfile.next()
    id = set([s.rstrip() for s in idfile])
    idfile.close()
    oname = args.file[:-7] + 'result.txt' 
    o = open(oname, 'w')
    o.write(file.next())
    for l in file:
        tmp = l.rsplit('\t')
        if(tmp[0].rstrip() in ids):
            o.write(l)
    o.close()

but I get an error, which I don't understand as this script was used on the same file as before and it worked.. not sure what is going on in here... anyone help?

Code:
File "extract.py", line 24, in <module>
    for l in file:
  File "/usr/lib64/python2.7/gzip.py", line 450, in readline
    c = self.read(readsize)
  File "/usr/lib64/python2.7/gzip.py", line 256, in read
    self._read(readsize)
  File "/usr/lib64/python2.7/gzip.py", line 307, in _read
    uncompress = self.decompress.decompress(buf)
zlib.error: Error -3 while decompressing: invalid block type

 

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Perl - extracting data from .csv files

PROJECT: Extracting data from an employee timesheet. The timesheets are done in excel (for user ease) and then converted to .csv files that look like this (see color code key below): ,,,,,,,,,,,,,,,,,,, 9/14/2003,<-- Week Ending,,,,,,,,,,,,,,,,,, Craig Brennan,,,,,,,,,,,,,,,,,,,... (3 Replies)
Discussion started by: kregh99
3 Replies

2. Shell Programming and Scripting

extracting data from files..

frnds, I m having prob woth doing some 2-3 task simultaneously... what I want is... I have lots ( lacs ) of files in a dir... I want.. these info from arround 2-3 months files filename convention is - abc20080403sdas.xyz ( for todays files ) I want 1. total no of files for 1 dec... (1 Reply)
Discussion started by: clx
1 Replies

3. UNIX for Dummies Questions & Answers

Extracting data from many compressed files

I have a large number (50,000) of pretty large compressed files and I need only certain lines of data from them (each relevant line contains a certain key word). Each file contains 300 such lines. The individual file names are indexed by file number (file_name.1, file_name.2, ... ,... (1 Reply)
Discussion started by: Boltzmann
1 Replies

4. UNIX for Dummies Questions & Answers

Finding and Extracting uniq data in multiple files

Hi, I have several files that look like this: File1.txt Data1 Data2 Data20 File2.txt Data1 Data5 Data10 File3.txt Data1 Data2 Data17 File4.txt (6 Replies)
Discussion started by: Fahmida
6 Replies

5. Shell Programming and Scripting

awk - extracting data from a series of files

Hi, I am trying to extract data from multiple output files. I am able to extract the data from a single output file by using the following awk commands: awk '/ test-file*/{print;m=0}' out1.log > out1a.txt awk '/ test-string/{m=1;c=0}m&&++c==3{print $2 " " $3 " " $4 ;m=0}' out1.log >... (12 Replies)
Discussion started by: p_sun
12 Replies

6. UNIX for Dummies Questions & Answers

Extracting data from PDF files into CSV file

Hi, I have several hundreds of PDFfiles number 01.pdf, 02.pdf, 03.pdf, etc in one folder. These are vey long documentd with a lot of information (text, tables, figures, etc). I need to extract the information asociated with one disease in particular (Varicella). The information I need to... (5 Replies)
Discussion started by: Xterra
5 Replies

7. Shell Programming and Scripting

Extracting Delimiter 'TAG' Data From log files

Hi I am trying to extract data from within a log file and output format to a new file for further manipulation can someone provide script to do this? For example I have a file as below and just want to extract all delimited variances of tag 32=* up to the delimiter "|" and output to a new file... (2 Replies)
Discussion started by: Buddyluv
2 Replies

8. Shell Programming and Scripting

Bash script with python slicing on multiple data files

I have 2 files generated in linux that has common output and were produced across multiple hosts with the same setup/configs. These files do some simple reporting on resource allocation and user sessions. So, essentially, say, 10 hosts, with the same (2) system reporting in the files, so a... (0 Replies)
Discussion started by: jdubbz
0 Replies

9. Shell Programming and Scripting

Extracting data from specific rows and columns from multiple csv files

I have a series of csv files in the following format eg file1 Experiment Name,XYZ_07/28/15, Specimen Name,Specimen_001, Tube Name, Control, Record Date,7/28/2015 14:50, $OP,XYZYZ, GUID,abc, Population,#Events,%Parent All Events,10500, P1,10071,95.9 Early Apoptosis,1113,11.1 Late... (6 Replies)
Discussion started by: pawannoel
6 Replies

10. Shell Programming and Scripting

Extracting part of data from files

Hi All, I have log files as below. log1.txt <table name="content_analyzer" primary-key="id"> <type="global" /> </table> <table name="content_analyzer2" primary-key="id"> <type="global" /> </table> Time taken: 1.008 seconds ID = gd54321bbvbvbcvb <table name="content_analyzer"... (7 Replies)
Discussion started by: ROCK_PLSQL
7 Replies
Prophet::Test(3pm)					User Contributed Perl Documentation					Prophet::Test(3pm)

   set_editor($code)
       Sets the subroutine that Prophet should use instead of "Prophet::CLI::Command::edit_text" (as this routine invokes an interactive editor)
       to $code.

   set_editor_script SCRIPT
       Sets the editor that Proc::InvokeEditor uses.

       This should be a non-interactive script found in t/scripts.

   import_extra($class, $args)
   in_gladiator($code)
       Run the given code using Devel::Gladiator.

   repo_path_for($username)
       Returns a path on disk for where $username's replica is stored.

   repo_uri_for($username)
       Returns a file:// URI for $USERNAME'S replica (with the correct replica type prefix).

   replica_uuid
       Returns the UUID of the test replica.

   database_uuid
       Returns the UUID of the test database.

   replica_last_rev
       Returns the sequence number of the last change in the test replica.

   as_user($username, $coderef)
       Run this code block as $username.  This routine sets up the %ENV hash so that when we go looking for a repository, we get the user's repo.

   replica_uuid_for($username)
       Returns the UUID of the given user's test replica.

   database_uuid_for($username)
       Returns the UUID of the given user's test database.

   ok_added_revisions( { CODE }, $numbers_of_new_revisions, $msg)
       Checks that the given code block adds the given number of changes to the test replica. $msg is optional and will be printed with the test
       if given.

   serialize_conflict($conflict_obj)
       Returns a simple, serialized version of a Prophet::Conflict object suitable for comparison in tests.

       The serialized version is a hash reference containing the following keys:
	   meta => { original_source_uuid => 'source_replica_uuid' }
	   records => { 'record_uuid' =>
			  { change_type => 'type',
			    props => { propchange_name => { source_old => 'old_val',
							    source_new => 'new_val',
							    target_old => 'target_val',
							  }
				     }
			  },
			'another_record_uuid' =>
			  { change_type => 'type',
			    props => { propchange_name => { source_old => 'old_val',
							    source_new => 'new_val',
							    target_old => 'target_val',
							  }
				     }
			  },
		      }

   serialize_changeset($changeset_obj)
       Returns a simple, serialized version of a Prophet::ChangeSet object suitable for comparison in tests (a hash).

   run_command($command, @args)
       Run the given command with (optionally) the given args using a new Prophet::CLI object. Returns the standard output of that command in
       scalar form or, in array context, the STDOUT in scalar form *and* the STDERR in scalar form.

       Examples:

	   run_command('create', '--type=Foo');

   load_record($type, $uuid)
       Loads and returns a record object for the record with the given type and uuid.

   as_alice CODE, as_bob CODE, as_charlie CODE, as_david CODE
       Runs CODE as alice, bob, charlie or david.

perl v5.10.1							    2009-09-02							Prophet::Test(3pm)
All times are GMT -4. The time now is 02:28 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy