Help with a project. convert a txt to csv


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers Help with a project. convert a txt to csv
# 1  
Old 01-30-2012
Help with a project. convert a txt to csv

Hi people. I've finally converted to linux, and I'm starting to explore the amazing capabilities of the terminal. At the moment in trying to learn how to extract text using the "grep" and "sed" command. I decided to learn by trying to figure out how to solve a practical problem. I have a schedule in the form of a text file, and im trying to write a simple script that will convert it into a .csv file to be used in calendars like gcal (we get a schedule like this every semester, and manually editing it is a daunting task). My plan is to finally write a script that would do the job for me. I decided to post my "project" here instead of making multiple posts.

I've managed to do some extracting (im sure not in the most efficient way) and have come across some problems which i haven't been able to solve on my own. I absolutely do not expect someone to do the entire job for me, just would be nice on some tips if some of you guys have the time. I will post my progress so far, and the my main problems at the end.

So this is the headers for the .csv im planning to make
Code:
Start Date, End Date, Start Time, End Time, Description, Subject, Location

so like this
Code:
03.10.11,03.10.11,8:00,10:15,seminar,MCK16 Reproduct/Develop,821 Gynaecology

The text file comes in this format
Code:
       Academic Year 2011-2012                    date 15.09.11 
                       General Medicine / English 
Study: D                  Study Year: 4                 Term: First
                                                                                 
Study Week: 1     * 03.10.11 -  08.10.11 *         Study:D Year:4
Day:Monday  /03.10.11                            
 8:00  10:15  4/1,3,               seminar    MCK16 Reproduct/Develop   821 Gynaecology
 8:00  10:15  4/2,4,               seminar    MCC8 Dis.RenalFunc./Oedem 824 1stInternal Med
10:30  13:00  4/1,3,               practice   MCK16 Reproduct/Develop   821 Gynaecology
10:30  13:00  4/2,4,               seminar    MCC8 Dis.RenalFunc./Oedem 221 Seminar Room
14:30  17:00  4/10,                elect.cour Urgent proc. in burned    804 Burns Surgery
Day:Tuesday /04.10.11                            
 8:00   9:30  4/1,3,               seminar    MCK16 Reproduct/Develop   821 Gynaecology
 8:00  10:15  4/2,4,               seminar    MCC8 Dis.RenalFunc./Oedem 422 Seminar Room
 9:35  10:20  4/1,3,               individ.st MCK16 Reproduct/Develop   104 Individ.study
10:30  12:45  4/1,3,               individ.st MCK16 Reproduct/Develop   104 Individ.study
10:30  13:00  4/2,4,               seminar    MCC8 Dis.RenalFunc./Oedem 422 Seminar Room
14:00  15:30  4/1,2,3,4,           lecture    Psychology/pathopsychol.  220 Seminar Room
15:45  17:15  4/1,2,3,4,           practice   Psychology/pathopsychol.  220 Seminar Room
Day:Wednesday/05.10.11                            
 7:45  10:15  4/1,3,               seminar    MCK16 Reproduct/Develop   821 Gynaecology
 8:00  10:15  4/2,4,               seminar    MCC8 Dis.RenalFunc./Oedem 423 Seminar Room
10:30  13:00  4/1,3,               practice   MCK16 Reproduct/Develop   821 Gynaecology
10:30  12:00  4/2,4,               seminar    MCC8 Dis.RenalFunc./Oedem 423 Seminar Room
12:15  13:00  4/2,4,               seminar    MCC8 Dis.RenalFunc./Oedem 423 Seminar Room
14:00  15:30  4/10,                elect.cour Advanced Czech communic.  960 FNKV/pav.X/lang
Day:Thursday/06.10.11                            
 8:00  10:15  4/1,3,               practice   MCK16 Reproduct/Develop   821 Gynaecology
 8:00  13:00  4/2,4,               practice   MCC8 Dis.RenalFunc./Oedem 824 1stInternal Med
10:45  13:15  4/1,3,               practice   MCK16 Reproduct/Develop   821 Gynaecology
14:30  16:00  4/10,                elect.cour Endoscopic/robotic urol.  909 Thomayer Hosp.
14:30  15:15  4/10,                elect.cour Methods Nuclear Cardiol.  838 Urolog/NuclMed.
Day:Friday  /07.10.11  
 8:00  10:15  4/1,3,               seminar    MCK16 Reproduct/Develop   906 ÚPMD
 8:00   9:45  4/2,4,               seminar    MCC8 Dis.RenalFunc./Oedem 422 Seminar Room
10:00  13:15  4/2,4,               seminar    MCC8 Dis.RenalFunc./Oedem 422 Seminar Room
11:00  13:30  4/1,3,               practice   MCK16 Reproduct/Develop   220 Seminar Room

1. i did
Code:
grep -vi 'Study week' schedule.txt | grep -vi 'academic year' | grep -vi 'general medicine' | grep -vi 'Study: D'

to remove the header and recurring week.

2. Then i tried doing
Code:
sed 's/Day:.*day//g' schedule.txt

to remove the Day: string, but was unable to figure out a wildcard for spaces and was left with "/"s, so i decided to do
Code:
sed 's/Day:Monday  \///g' schedule.txt | sed 's/Day:Tuesday \///g' | sed 's/Day:Wednesday\///g' | sed 's/Day:Thursday\///g' | sed 's/Day:Friday  \///g'

instead.

3. I removed the entries that didnt fit my study group (im in group 4/2,4,). First by
Code:
grep -iv 4\/1,3, schedule.txt | grep -iv 4\/10

and then the remaining group numbers as i dont need them
Code:
sed 's/4\/....//g' schedule.txt |sed 's/3,4,//g'

This is my output so far:
Code:
03.10.11                            
 8:00  10:15                 seminar    MCC8 Dis.RenalFunc./Oedem 824 1stInternal Med
10:30  13:00                 seminar    MCC8 Dis.RenalFunc./Oedem 221 Seminar Room
04.10.11                            
 8:00  10:15                 seminar    MCC8 Dis.RenalFunc./Oedem 422 Seminar Room
10:30  13:00                 seminar    MCC8 Dis.RenalFunc./Oedem 422 Seminar Room
14:00  15:30             lecture    Psychology/pathopsychol.  220 Seminar Room
15:45  17:15             practice   Psychology/pathopsychol.  220 Seminar Room

And now comes my problems:

First i need to move the date from its line and infront of each subject belonging to this date and duplicate it so that the format becomes like this
Code:
03.10.11 03.10.11 8:00  10:15                 seminar    MCC8 Dis.RenalFunc./Oedem 824 1stInternal Med
03.10.11 03.10.11 10:30  13:00                 seminar    MCC8 Dis.RenalFunc./Oedem 221 Seminar Room                          
04.10.11 04.10.11 8:00  10:15                 seminar    MCC8 Dis.RenalFunc./Oedem 422 Seminar Room
04.10.11 04.10. 1110:30  13:00                 seminar    MCC8 Dis.RenalFunc./Oedem 422 Seminar Room

I have no idea on how to do this.

And then second i need to use
Code:
sed 's...

(i guess) to replace spaces with commas. This is no problem but as you can see some of the headers has a single space in it. This happens for subject and location. But im thinking maybe since there are multiple spaces between subject and location, and each location starts with a number this might be used.

Anyways this turned out to be a very long post. I apologize, and i hope i posted in the right forum. As i said before i do not expect anyone to do the work for me, but any tips regarding any of the sections are more than welcome

Best regards
# 2  
Old 01-30-2012
For more complex text processing, I like to use Perl. If you have it installed in your environment, I found this to work on your initial data file:
Code:
#!/usr/bin/perl -w

open (DATAFILE, $ARGV[0]) or die ("Could not open data file.");

foreach $line (<DATAFILE>){
        $line =~ s/\s+$//;
        if (length($line) > 0) {
                if ( $line =~ /^Study Week:/ ) {
                      @lineArray = split("\\*|\\-",$line);
                      $bdate = $lineArray[1];
                      $bdate =~ s/^\s+|\s+$//g;
                      $edate = $lineArray[2];
                      $edate =~ s/^\s+|\s+$//g;
                } elsif ( $line =~ /^Day:/ ) {
                      @lineArray = split("/",$line);
                      $bdate = $lineArray[1];
                } elsif ( $line =~ /^ ?[0-9]/ ) {
                      print(join(',',$bdate,$edate,substr($line,0,5),substr($line,7,5),substr($line,35,10),substr($line,46,25),substr($line,72)."\n"));
                }
        }
}

I am just assuming that the begin date and end date are in that first line "Study Week: 1 * 03.10.11 - 08.10.11 * ...", and that the begin date changes as the document progresses and the lines beginning with "Day" are encountered. Here is the program running on the text file provided:
Code:
./filter.pl ../dat.dat
03.10.11,08.10.11, 8:00,10:15,seminar   ,MCK16 Reproduct/Develop  ,821 Gynaecology
03.10.11,08.10.11, 8:00,10:15,seminar   ,MCC8 Dis.RenalFunc./Oedem,824 1stInternal Med
03.10.11,08.10.11,10:30,13:00,practice  ,MCK16 Reproduct/Develop  ,821 Gynaecology
03.10.11,08.10.11,10:30,13:00,seminar   ,MCC8 Dis.RenalFunc./Oedem,221 Seminar Room
03.10.11,08.10.11,14:30,17:00,elect.cour,Urgent proc. in burned   ,804 Burns Surgery
04.10.11,08.10.11, 8:00, 9:30,seminar   ,MCK16 Reproduct/Develop  ,821 Gynaecology
04.10.11,08.10.11, 8:00,10:15,seminar   ,MCC8 Dis.RenalFunc./Oedem,422 Seminar Room
04.10.11,08.10.11, 9:35,10:20,individ.st,MCK16 Reproduct/Develop  ,104 Individ.study
04.10.11,08.10.11,10:30,12:45,individ.st,MCK16 Reproduct/Develop  ,104 Individ.study
04.10.11,08.10.11,10:30,13:00,seminar   ,MCC8 Dis.RenalFunc./Oedem,422 Seminar Room
04.10.11,08.10.11,14:00,15:30,lecture   ,Psychology/pathopsychol. ,220 Seminar Room
04.10.11,08.10.11,15:45,17:15,practice  ,Psychology/pathopsychol. ,220 Seminar Room
05.10.11,08.10.11, 7:45,10:15,seminar   ,MCK16 Reproduct/Develop  ,821 Gynaecology
05.10.11,08.10.11, 8:00,10:15,seminar   ,MCC8 Dis.RenalFunc./Oedem,423 Seminar Room
05.10.11,08.10.11,10:30,13:00,practice  ,MCK16 Reproduct/Develop  ,821 Gynaecology
05.10.11,08.10.11,10:30,12:00,seminar   ,MCC8 Dis.RenalFunc./Oedem,423 Seminar Room
05.10.11,08.10.11,12:15,13:00,seminar   ,MCC8 Dis.RenalFunc./Oedem,423 Seminar Room
05.10.11,08.10.11,14:00,15:30,elect.cour,Advanced Czech communic. ,960 FNKV/pav.X/lang
06.10.11,08.10.11, 8:00,10:15,practice  ,MCK16 Reproduct/Develop  ,821 Gynaecology
06.10.11,08.10.11, 8:00,13:00,practice  ,MCC8 Dis.RenalFunc./Oedem,824 1stInternal Med
06.10.11,08.10.11,10:45,13:15,practice  ,MCK16 Reproduct/Develop  ,821 Gynaecology
06.10.11,08.10.11,14:30,16:00,elect.cour,Endoscopic/robotic urol. ,909 Thomayer Hosp.
06.10.11,08.10.11,14:30,15:15,elect.cour,Methods Nuclear Cardiol. ,838 Urolog/NuclMed.
07.10.11,08.10.11, 8:00,10:15,seminar   ,MCK16 Reproduct/Develop  ,906 ÚPMD
07.10.11,08.10.11, 8:00, 9:45,seminar   ,MCC8 Dis.RenalFunc./Oedem,422 Seminar Room
07.10.11,08.10.11,10:00,13:15,seminar   ,MCC8 Dis.RenalFunc./Oedem,422 Seminar Room
07.10.11,08.10.11,11:00,13:30,practice  ,MCK16 Reproduct/Develop  ,220 Seminar Room

# 3  
Old 01-31-2012
@Truk451: Thanks man, amazing script. I haven't heard of Perl but i'll definitely look in to it, looks like it might me more powerful than bash. Would you recommend me to stop learning bash and start trying to learn Perl instead?

I did find two problems though (I'm sorry if i didnt explain myself well enough).

The first one is that i have to remove the entire lines containing "4/1,3," and 4/10," as those classes are for another group (my classes are 4/2,4, and 4/1,2,3,4,)) (thats what i tried to do in my step 3). i tried running your script after removing them from the document, but that just messed it up.


The second: since the lesson starts and ends at the same date they need to be the same.
Code:
03.10.11,08.10.11, 8:00,10:15,seminar   ,MCK16 Reproduct/Develop  ,821 Gynaecology

will create an entry that starts at 08:00 on the 03.10.11 and end on 10:15 on the 08.10.11, thats a loong class even for medical school Smilie . However its easy for me to solve by importing the csv to a spreadsheet and just delete the column with the 2. date and insert a duplicate of the 1. date. Would be interesting to see the code though if you have time.

Anyways, thank you very much.

Best regards.
# 4  
Old 01-31-2012
Oh ok no problem. Here are the changes incorporated:
Code:
#!/usr/bin/perl -w

open (DATAFILE, $ARGV[0]) or die ("Could not open data file.");

print("Start Date, End Date, Start Time, End Time, Description, Subject, Location\n");

foreach $line (<DATAFILE>){
        $line =~ s/\s+$//;
        if (length($line) > 0) {
                if ( $line =~ /^Day:/ ) {
                      @lineArray = split("/",$line);
                      $date = $lineArray[1];
                } elsif ( $line =~ /^ ?[0-9]/ ) {
                      if ( $line =~ /(4\/1,3,)|(4\/10,)/ ) {
                           next;
                      } else {
                           print(join(',',$date,$date,substr($line,0,5),substr($line,7,5),substr($line,35,10),substr($line,46,25),substr($line,72)."\n"));
                      }
                }
        }
}

As for my recommendations, I think that Perl has more capabilities than bash shell scripting, but don't get me wrong: bash scripting is extremely powerful and useful. You should probably start with bash and if you are interested in interpreted programming languages you could learn Perl or Ruby or something. The reason why I used Perl here is the more complex line parsing you wanted to do. While it could definitely be done with a complex sed statement or something similar, ultimately the sed statement would use the same logic which can be seen in the Perl code, using the position of the data fields as the way to parse them from the line and printing them selectively in the output.

Your initial idea to replace the spaces with commas would not have worked because of the extra spaces in the fields for Subject and Location
# 5  
Old 02-01-2012
Thanks again man!

Ok i will follow your recommendations and keep continue learning bash and then later switching to a more advanced language (like pearl or python). Right now im trying to figure out your script and how it works as an exercise.

Thanks again mate!

Best regards
 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Convert a txt file to a CSV file

Hi , I have a Txt file which consist of 1000's of SOAP request and response and i want the file to be converted to a csv file like column a should have a soap request and column b should have the soap response . can someone assist me in achieving this please ? Thanks (2 Replies)
Discussion started by: kumarm8
2 Replies

2. Solaris

How to convert pdf file to txt?

Hello Unix gurus, I am learning unix. I have lots pdf data files. I need to convert them into txt files. Can you please guide me how to do that? Thanks in advance. Rao (1 Reply)
Discussion started by: raopatwari
1 Replies

3. Red Hat

How to convert TXT to PDF in RHEL 6?

Hello friends, I need to convert ASCII text to PDF on RHEL 6 so I did the below and could generate PDF but it has lot of junk/special characters. yum install enscript ghostscript enscript -p output.ps input.txt ps2pdf output.ps output.pdf So I download latest source of Ghostscript... (4 Replies)
Discussion started by: magnus29
4 Replies

4. Shell Programming and Scripting

Need script to convert TXT file into CSV

Hi Team, i have some script which give output in TXT format , need script to convert TXT file into CSV. Output.TXT 413. U-UU-LVDT-NOD-6002 macro_outcome_dist-8.0.0(v1_0_2) KK:1.2.494 (1234:333:aaa:2333:3:2:333:a) 414. U-UU-LVDT-NOD-6004 ... (10 Replies)
Discussion started by: Ganesh Mankar
10 Replies

5. Shell Programming and Scripting

txt file to CSV

hi.. I have a text file which looks likes this 2258 4569 1239 258 473 i need to convert it into comma seperated format eg:2258,4569,1239,258,437 pls help (8 Replies)
Discussion started by: born
8 Replies

6. Shell Programming and Scripting

Convert txt to csv

Hi - I am looking to convert the following text to csv. The columns may not always have data in them and they may have varying spaces but I still need to have a comma there anyway: Sample Data: ~~~~~~~ Name Email Location Phone Tom... (4 Replies)
Discussion started by: JPBovaird
4 Replies

7. Shell Programming and Scripting

Txt to csv convert

Hi, I was trying some split command to pull out values like "uid=abc,ou=INTERNAL,ou=PEOPLE" into a csv file. However because of erratic nature of occurrance of rows made me stopped. Could someone help me in this? and if someone has a one liner for this? The text file contain pattern like this... (14 Replies)
Discussion started by: john_prince
14 Replies

8. Shell Programming and Scripting

how to convert XLS to CSV and DOC/RTF to TXT

Hi, i don't know anything about PERL. Can anyone help me providing PERL scripts for 1. converting XLS to CSV (and vice-versa) 2. converting DOC/RTF to TXT Thanks much Prvn (1 Reply)
Discussion started by: prvnrk
1 Replies

9. Shell Programming and Scripting

AWK CSV to TXT format, TXT file not in a correct column format

HI guys, I have created a script to read 1 column in a csv file and then place it in text file. However, when i checked out the text file, it is not in a column format... Example: CSV file contains name,age aa,11 bb,22 cc,33 After using awk to get first column TXT file... (1 Reply)
Discussion started by: mdap
1 Replies

10. UNIX for Dummies Questions & Answers

How to convert PS files to txt file?

Hi, I need to convert PS files to txt file. I thought of using ps2ascii, but its not installed in my AIX box, any other option? (2 Replies)
Discussion started by: redlotus72
2 Replies
Login or Register to Ask a Question