Quote:
Originally Posted by
ilikecows
Even in a csv spreadsheet, its still arbitrary, but it wouldn't make much sense to use anything other than a comma.
In a *c*sv-file (which is called "comma-separated" for some kind of reason perhaps) this is right. But then, in a comma-separated file there is no necessity to find out the delimiter, as the thread opener wanted to know and has asked.
Quote:
Originally Posted by
ilikecows
A delimiter is arbitrary in the sense that any character can be used, but its not in the sense that if the character doesn't actually mark anything, its not a delimiter.
This is a misunderstanding: "doesn't [actually] mark anything" means you second-guess what exactly establishes a meaning. Consider:
Lets say the pipe character is used as delimiter: several of the fields delimited this way are empty. Do these empty fields establish useful information or not?
Furthermore - sorry, this gets somewhat philosophical -, "meaning" is not an inherent quality at all. The string "abc" might have a meaning or not, depending on what we agree to establish meaning, depending on context, whatever.
Your argument comes down to "plausibility" and while i agree with you that limiting your search for solutions to plausible or obvious ones most times helps to solve real-world problems faster, it simply doesn't help if you are trying to find generalized solutions - like in "write a script to find the delimiter".
Consider the string "a||b||c": does this mean three fields, "a", "b" and "c", delimited by a double pipe char or does it mean 5 fields, two of them empty? Both variants would be plausible enough, both might be correct - or wrong, depending on the intention of the one who wrote the line. But this information cannot be discerned from the file alone at all. You will need some additional information - context - to do so.
Quote:
The | is the only delimiter because it is the only character that is actually marking the beginning or end of a unit of data.
Again, this is appealing to some plausibility. Everything can be considered "data", "afie" or "D1" is (or can be) as much data as "afie|D1" or whatever substring you extract from this line. If it is data or not depends on your ability to derive meaning from it. Again: context.
If i give you a succession of characters, say "R-O-T" - is it data? In other words, does it have a meaning? As long as you don't have additional information you can't decide this question at all. For instance, if you know we are talking in English then this would constitue a word (a verb) and have a meaning. If you know that we are talking in german this would also have a meaning, but a different one ("rot" means "red" and is an adjective) - and if you know we are talking Italian it would have no meaning at all as there is no word "rot" in Italian. It would be some garbled transmission in this case. This means, you need to have (or need to assume) some context (the language) to decide if this string is data or not.
The human brain is very very good in finding (or constructing) patterns, real ones or - in some pathological cases, like the mathematician John Nash - imagined ones. Still, finding a pattern is not discovering some inherent quality of the presented data but to put some organization on received information. But this organization is put on this information from outside and therefore is, what i said: arbitrary. None of these organizations is "better" or "more correct" than any other.
bakunin