split a file with unique sets


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers split a file with unique sets
# 1  
Old 10-22-2008
split a file with unique sets

This may sound like a trivial problem, but I still need some help:

I have a file with ids and I want to split it 'n' ways (could be any number) into files:

1
1
1
2
2
3
3
4
5
5

Let's assume 'n' is 3, and we cannot have the same id in two different partitions. So the partitions may look like (1,1,1,), (2,2,3,3),(4,5,5).

Thanks guys,

- CB
# 2  
Old 10-22-2008
You're aware there's a limited number of combinations for a given fixed number of ids, right?
So, if you have 1 1 1 2 2 3 3 4 5 5 in the example you gave, and this happens more than once, the first partition could be (1,1,1),(2,2,3,3),(4,5,5) and a second occurrence of those ids could generate a partition like (1,1), (1, 2,2,3,3),(4,5,5), right?
# 3  
Old 10-22-2008
The second combination that you listed is incorrect because partition 1 and 2 both have id '1' in them.

Thanks,

- CB
# 4  
Old 10-22-2008
I don't understand what you want then. Give me an example, if I have the same ID twice, how should the program handle it?
# 5  
Old 10-22-2008
I was hoping for a better solution, but here is a crude way that i thought of:

1. split the file 'n' ways (n=3 for this example):

part 1 part 2 part 3
1 2 3
1 2 4
1 3 5

2. if n%(size of orig file) = 3%10 > 0 then append remaining id to the last partition

part 3
3
4
5
5

3. Compare part 1 with part 2 and see if ids are matched. If found, then move row from part 2 to part 1. Move to the next part and do the same.

part 1
1
1
1

part 2
2
2
3
3

part 3
3
4
5
5
Hopefully, someone will present a sleeker solution with some syntax.

Thanks,

- CB
# 6  
Old 10-22-2008
by the way, the original file is sorted by id
# 7  
Old 10-22-2008
The OP wants all the 1's in a single file, 2's in a single file possibly with all 3's in the same file as well.

The problem is you have to know the split count as well as the complete key list and count of unique keys and how to group them before you attempt a split. I would create a list of unique key fields, divide the count by 3 and let any extras fall into the last split.

The problem with this is that you can get splits of enormously different sizes depending on how skewed the distribution of keys is in the data file. It defeats splitting altogether - IMO. And what happens when you ask for more splits than there are keys?

The only thing that that makes sense to me is a one-to-one split - one distinct key per file or leave everything in one big file.
 
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Split into multiple files by using Unique columns in a UNIX file

I have requirement to split below file (sample.csv) into multiple files by using the unique columns (first 3 are unique columns) sample.csv 123|22|56789|ABCDEF|12AB34|2019-07-10|2019-07-10|443.3400|1|1 123|12|5679|BCDEFG|34CD56|2019-07-10|2019-07-10|896.7200|1|2... (3 Replies)
Discussion started by: RVSP
3 Replies

2. UNIX for Beginners Questions & Answers

sed awk: split a large file to unique file names

Dear Users, Appreciate your help if you could help me with splitting a large file > 1 million lines with sed or awk. below is the text in the file input file.txt scaffold1 928 929 C/T + scaffold1 942 943 G/C + scaffold1 959 960 C/T +... (6 Replies)
Discussion started by: kapr0001
6 Replies

3. Shell Programming and Scripting

Identifying dupes within a database and creating unique sub-sets

Hello, I have a database of name variants with the following structure: variant=variant=variant The number of variants can be as many as thirty to forty. Since the database is quite large (at present around 60,000 lines) duplicate sets of variants creep in. Thus John=Johann=Jon and... (2 Replies)
Discussion started by: gimley
2 Replies

4. Shell Programming and Scripting

Change unique file names into new unique filenames

I have 84 files with the following names splitseqs.1, spliseqs.2 etc. and I want to change the .number to a unique filename. E.g. change splitseqs.1 into splitseqs.7114_1#24 and change spliseqs.2 into splitseqs.7067_2#4 So all the current file names are unique, so are the new file names.... (1 Reply)
Discussion started by: avonm
1 Replies

5. Shell Programming and Scripting

sort split merge -u unique

Hi, this is about sorting a very large file (like 10 gb) to keep lines with unique entries across SOME of the columns. The line originally looked like this: sort -u -k2,2 -k3,3n -k4,4n -k5,5n -k6,6n file_unsorted > file_sorted please note the -u flag. The problem is that this single... (4 Replies)
Discussion started by: jbr950
4 Replies

6. Shell Programming and Scripting

get part of file with unique & non-unique string

I have an archive file that holds a batch of statements. I would like to be able to extract a certain statement based on the unique customer # (ie. 123456). The end for each statement is noted by "ENDSTM". I can find the line number for the beginning of the statement section with sed. ... (5 Replies)
Discussion started by: andrewsc
5 Replies

7. Virtualization and Cloud Computing

Clouds (Partially Order Sets) - Streams (Linearly Ordered Sets) - Part 2

timbass Sat, 28 Jul 2007 10:07:53 +0000 Originally posted in Yahoo! CEP-Interest Here is my follow-up note on posets (partially ordered sets) and tosets (totally or linearly ordered sets) as background set theory for event processing, and in particular CEP and ESP. In my last note, we... (0 Replies)
Discussion started by: Linux Bot
0 Replies

8. AIX

IP Security file sets

hello, we are implementing ip security on several of our aix 5.2-09 boxes and i am unable to locate the prerequisite file sets. does anyone know where i can find these? i have the original 5.2 cd's but these file sets are not on any of the cd's. Any thoughts or suggestions? (3 Replies)
Discussion started by: zuessh
3 Replies

9. UNIX for Advanced & Expert Users

FILE SETS in unix

Hi all, Pls. let me know whether there is any concept called "FILE SETS" in unix? Because, I am using ETL tool DataStage which creates FILE SETS. While I am able to view the data of such a file set in the tool, the "cat" command on this FILESET lists only the Metadata and not the data content... (2 Replies)
Discussion started by: Aparna_A
2 Replies
Login or Register to Ask a Question