Extract large list of substrings


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Extract large list of substrings
# 1  
Old 09-08-2008
Extract large list of substrings

I have a very long string (millions of characters).

I have a file with start location and length that is thousands of rows long:

Start Length
5 10
16 21
44 100
215 37
...

I'd like to extract the substring that corresponds to the start and length from each row of the list:

I tried just using a large awk '{print substr($1,5,10), "\n", substr($1,16,21) "\n", substr($1,44,100) "\n", substr($1,215,37)...}' infile > outfile &

command

But it seems to hang likely because the Bash line is too long.

Can you help me with a way to get out the substrings as rows.
# 2  
Old 09-08-2008
Quote:
Originally Posted by dcfargo
I have a very long string (millions of characters).

Where do you have it? Is it in a file? In a variable?

Are there any newlines in the string?
Quote:
I have a file with start location and length that is thousands of rows long:

Start Length
5 10
16 21
44 100
215 37
...

I'd like to extract the substring that corresponds to the start and length from each row of the list:

I tried just using a large awk '{print substr($1,5,10), "\n", substr($1,16,21) "\n", substr($1,44,100) "\n", substr($1,215,37)...}' infile > outfile &

command

But it seems to hang likely because the Bash line is too long.

I have no problem extracting portions of a multimegabyte string using bash's parameter expansion:
Code:
## Assuming the string is in 'infile'
string=$( < infile )
while read start length
do
  printf "%s\n" "${string:$start:$length}"
done < /path/to/file/with/startpoints_and_lengths

# 3  
Old 09-08-2008
Thanks. Let me see if I understand.

The string has no space or line breaks its just millions of characters one after the other. We'll call that 'filestring'.

The numbers lists are in 'filenumbers' in the same directory.


string=$( <filestring )
while read start length
do
printf "%s\n" "${string:$start:$length}"
done < filenumbers > outfile


If that the correct command syntax?

Thanks so much.
# 4  
Old 09-08-2008
Quote:
Originally Posted by dcfargo
Thanks. Let me see if I understand.

The string has no space or line breaks its just millions of characters one after the other. We'll call that 'filestring'.

The numbers lists are in 'filenumbers' in the same directory.

Code:
string=$( <filestring )
while read start length
do
  printf "%s\n" "${string:$start:$length}"
done < filenumbers > outfile

If that the correct command syntax?

That is correct if the string is in a file called filestring.

If it is already in a variable, use that variable instead of string
# 5  
Old 09-09-2008
I don't know what I'm doing wrong but that syntax appears to be writing the entire string for each line in the filenumbers instead of extracting the substring(s).
# 6  
Old 09-09-2008
Quote:
Originally Posted by dcfargo
I don't know what I'm doing wrong but that syntax appears to be writing the entire string for each line in the filenumbers instead of extracting the substring(s).

No one else knows what you are doing wrong, either, because you didn't post the code you executed.

Nor did you make it clear whether you already have the string in a variable or whether it has to be read from a file.
# 7  
Old 09-09-2008
Sorry. You know what I did. I had the wrong input file. Your code works great. I really appreciate all your help. My input file was start and the length of the string instead of start and the length of the substring of interest.

Thanks again so much. Smilie Smilie Smilie
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Command to extract empty field in a large UNIX file?

Hi All, I have records in unix file like below. In this file, we have empty fields from 4th Column to 22nd Column. I have some 200000 records in a file. I want to extract records only which have empty fields from 4th field to 22nd filed. This file is comma separated file. what is the unix... (2 Replies)
Discussion started by: rakeshp
2 Replies

2. Shell Programming and Scripting

Need to extract 8 characters from a large file.

Hi All!! I have a large file containing millions of records. My purpose is to extract 8 characters immediately from the given file. 222222222|ZRF|2008.pdf|2008|01/29/2009|001|B|C|C 222222222|ZRF|2009.pdf|2009|01/29/2010|001|B|C|C 222222222|ZRF|2010.pdf|2010|01/29/2011|001|B|C|C... (5 Replies)
Discussion started by: pavand
5 Replies

3. UNIX for Dummies Questions & Answers

Extract spread columns from large file

Dear all, I want to extract around 300 columns from a very large file with almost 2million columns. There are no headers, but I can find out which column numbers I want. I know I can extract with the function 'cut -f2' for example just the second column but how do I do this for such a large... (1 Reply)
Discussion started by: fndijk
1 Replies

4. Shell Programming and Scripting

Curl download zip extract large xml file

Hi i have a php script that works 100% however i don't want this to run on php because of server limits etc. Ideally if i could convert this simple php script to a shell script i can set it up to run on a cron. My mac server has curl on it. So i am assuming i should be using this to download the... (3 Replies)
Discussion started by: timgolding
3 Replies

5. Shell Programming and Scripting

Extract three substrings from a logfile

I have a log file like below. 66.249.73.11 - - "UCiZ7QocVqYAABgwfP8AAHAA" "US" "Mediapartners-Google" "-" www.mahashwetha.com.sg "GET... (2 Replies)
Discussion started by: Tuxidow
2 Replies

6. Shell Programming and Scripting

Extract information into large variable

Hello people :) That's here my first message to your forum, so I guess it would be fine ^^ I have a request about a code I want to use. Actually, my system use a large variable, including much informations but those informations can change by more and I want to extract one of thoses... (26 Replies)
Discussion started by: WolwX
26 Replies

7. Shell Programming and Scripting

extract unique pattern from large text file

Hi All, I am trying to extract data from a large text file , I want to extract lines which contains a five digit number followed by a hyphen , like 12345- , i tried with egrep ,eg : egrep "+" text.txt but which returns all the lines which contains any number of digits followed by hyhen ,... (19 Replies)
Discussion started by: shijujoe
19 Replies

8. Shell Programming and Scripting

Extract data from large file 80+ million records

Hello, I have got one file with more than 120+ million records(35 GB in size). I have to extract some relevant data from file based on some parameter and generate other output file. What will be the besat and fastest way to extract the ne file. sample file format :--... (2 Replies)
Discussion started by: learner16s
2 Replies

9. Shell Programming and Scripting

Need to extract 7 characters immediately after text '19' from a large file.

Hi All!! I have a large file containing millions of record. My purpose is to extract 7 characters immediately after text '19' from this file (including text '19') and save the result in new file. So, my OUTPUT would be as under : 191234561 194567894 192789005 198839408 and so on..... ... (7 Replies)
Discussion started by: parshant_bvcoe
7 Replies

10. UNIX for Dummies Questions & Answers

List large files

Hi I need to list all files in the system: 1. Greater than specific size 2. All files sorted by size How can I do that? Thanks in advance. (2 Replies)
Discussion started by: GNMIKE
2 Replies
Login or Register to Ask a Question