How to parse a huge 600MB zipped file?


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting How to parse a huge 600MB zipped file?
# 1  
Old 07-09-2012
How to parse a huge 600MB zipped file?

I'm new to Unix, trying to parse a huge 600MB zipped file...
I need to bzcat this file once and do some calculations (word count) on the lines based on certain criteria (see script)
the correct result/output should be:
column1=6
column2=4
the problem is that I'm getting column2=0 (see results)
could you please help
thanks


source file: test.test.bz2
Code:
1,1,2,3
1,2,1,2
2,1,2,2
3,1,1,1
1,2,1,1
2,2,2,2
3,2,2,2

Script: test.bsh
Code:
 
#!/bin/bash
bzcat test.test.bz2 |
while read line
do
column1=$(awk '{FS=","} {print $1}' | uniq | wc -l)
                echo $column1
column2=$(awk '{FS=","} {print $2}' | uniq | wc -l)
                echo $column2
done

Results
Code:
bash -vx test.bsh

#!/bin/bash
bzcat test.test.bz2 |
while read line
do
column1=$(awk '{FS=","} {print $1}' | uniq | wc -l)
                echo $column1
column2=$(awk '{FS=","} {print $2}' | uniq | wc -l)
                echo $column2
done
+ bzcat test.test.bz2
+ read line
awk '{FS=","} {print $1}' | uniq | wc -l
++ awk '{FS=","} {print $1}'
++ uniq
++ wc -l
+ column1='       6'
+ echo 6
6
awk '{FS=","} {print $2}' | uniq | wc -l
++ awk '{FS=","} {print $2}'
++ uniq
++ wc -l
+ column2='       0'
+ echo 0
0
+ read line


Last edited by fpmurphy; 07-09-2012 at 08:33 PM.. Reason: code tags please!
# 2  
Old 07-09-2012
Everything within the while-loop inherits the same standard input, the pipe with bzcat's output. The first line is read by the read command, stored in the $line parameter, and then never used for anything. Then the first awk pipeline is invoked in a subshell, consuming everything that bzcat has to offer. By the time the second awk pipeline runs, there's nothing left; it immediately encounters EOF (end of file). This is why you have that zero.

The simple solution would be to pipe the contents of $line into those awk pipelines.

Regards and welcome to the forum,
Alister
# 3  
Old 07-10-2012
thanks alister, as I told you I'm still learning Unix... so not sure I get what you're trying to explain.
could you please clarify/fix my script so it works properly
regards
# 4  
Old 07-10-2012
Bug Little correction

@DeltaComp:

As Alister says, the awk command in the while loop is without any input to read. Thus
Code:
column1=$(awk '{FS=","} {print $1}' | uniq | wc -l)

takes the complete output of
Code:
bzcat test.test.bz2

Thus you get "column1=6".

But after printing the value of column1, the second awk command
Code:
 column2=$(awk '{FS=","} {print $2}' | uniq | wc -l)

does not have any input to process. Thus it's printing "column2=0".

Our advice would be, use awk commands like below in the script.
Code:
column1=$(bzcat test.test.bz2| awk '{FS=","} {print $1}' | uniq | wc -l)
                echo $column1
column2=$(bzcat test.test.bz2 | awk '{FS=","} {print $2}' | uniq | wc -l)
                echo $column2

So, we are giving the lines of your files as imputs seperately here Smilie
# 5  
Old 07-10-2012
pikk45, thanks for the clarification... I know that I could do bzcat twice...
but this way, I'm unzipping the source file twice which's consuming alot of server's resources (hight CPU utilization) on a prod server....
in my original request I mentioned that "I need to bzcat this file once and do some calculations" as it's taking over 20min for each operation
any better solution would be much appreciated.
I believe there must be a way to do it
# 6  
Old 07-10-2012
Wrong post!! Sorry!! Smilie
# 7  
Old 07-10-2012
Quote:
Originally Posted by PikK45
Our advice would be, use awk commands like below in the script.
Code:
column1=$(bzcat test.test.bz2| awk '{FS=","} {print $1}' | uniq | wc -l)
                echo $column1
column2=$(bzcat test.test.bz2 | awk '{FS=","} {print $2}' | uniq | wc -l)
                echo $column2

My simple suggestion was to do the following...
Code:
column1=$(printf "%s" "$line" | awk '{FS=","} {print $1}' | uniq | wc -l)
                echo $column1
column2=$(printf "%s" "$line" | awk '{FS=","} {print $2}' | uniq | wc -l)
                echo $column2

... but that wouldn't work. Woops.

The most efficient way to accomplish this that comes to mind is to do it all with a single awk invocation:

Code:
bzcat test.test.bz2 |
awk -F, '$1 != o1 {x++} $2 != o2 {y++} {o1 = $1; o2 = $2} END {print "column1="x; print "column2="y}'

Regards,
Alister
This User Gave Thanks to alister For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Work with huge Zipped files

Hello dear members, I have one general and one specific question which I will be very grateful if you could help me with them. Let's start with my general question: 1. I am working on cluster computer shared with other people and I need to manipulate a big zipped text file of 13 GB. There is... (1 Reply)
Discussion started by: Homa
1 Replies

2. Shell Programming and Scripting

awk to parse huge files

Hello All, I have a situation as below: (1) Read a source file (a single file of 1.2 million rows in it ) (2) Read Destination files one by one and replace the content ( few fields in it ) with the corresponding matching field from source file. I tried as below: ( please note I am not... (4 Replies)
Discussion started by: panyam
4 Replies

3. Shell Programming and Scripting

FTP'ing the zipped file

Hi, I need to have a shell script that FTP's a zipped file from a particular location. I have some path and inside that path i will have folders like x_timestamp and inside x_timestamp there may many folders based upon events like y_111,y_222,y_333.Inside each event there will be another... (3 Replies)
Discussion started by: weknowd
3 Replies

4. Solaris

How can I tell if a file is zipped or not?

SunOS xxxxxx 5.10 Generic_142900-15 sun4v sparc SUNW,T5240 We receive files that are sometimes zipped, but the file may not have the .gz or other extention that would indicated that the file is zipped. Is there a unix "test" command that I could use or something similar? Thanks in advance (7 Replies)
Discussion started by: Harleyrci
7 Replies

5. UNIX for Dummies Questions & Answers

Zipped tar file is corrupt

Hello, I am currently dumping 30-40 reports on a Unix folder located here /home/apps/reports/prode/excel I use K-shell to do this task. In that, I use the gzip command to compress these files. I want to be able to use a tar command to first load the entire directory into one file then gzip that... (2 Replies)
Discussion started by: Pramodini Rode
2 Replies

6. UNIX for Dummies Questions & Answers

reading a zipped file without unzipping it?

Dear all, I would like to ask how i can read a zipped file (file.gz) without actually unzipping it? i think there is a way to do so but i can't remember it.. can anyone help? thanks in advance.. (1 Reply)
Discussion started by: marwan
1 Replies

7. UNIX for Dummies Questions & Answers

how to check if file is zipped

I have a script that grabs files from directory , zips and moves them somewhere else. It works fine except the case when files it grabs are already zipped. Then it trys to zip it again which does not make sence. How can I check before zipping if file is already zipped? thanks in advance (3 Replies)
Discussion started by: arushunter
3 Replies

8. Shell Programming and Scripting

How to search a pattern inside a zipped file ie (.gz file) with out unzipping it

How to search a pattern inside a zipped file ie (.gz file) with out unzipping it? using grep command.. Bit urgent.. pls..help me (2 Replies)
Discussion started by: senraj01
2 Replies

9. UNIX for Dummies Questions & Answers

sendind a zipped file via email

Hi, I was not sure if I can do this. Suppose I have a file under /tmp Suppose the file is called any_11_52.txt Fisrt QUESTION??? If I zip this file using gzip will the user be able to unzip it , if I send it as an attachment in an email. Secondly is there a command by which we can... (2 Replies)
Discussion started by: rooh
2 Replies

10. UNIX for Dummies Questions & Answers

zipped or unzipped file

Is there a way you can tell if a file is still zipped or it's unzipped I have a file called ssss.zip and I would like to know if this file is still zipped or if it's unzipped? I'm on IBM AIX/RS6000 (3 Replies)
Discussion started by: ted
3 Replies
Login or Register to Ask a Question