본문 바로가기

진리는어디에

[C++/C#] Reading CSV file

I've been fiddling around trying to make a game myself at home these days. It was planned to manage the game's metadata (item information, monster stats, quests, etc.) in Excel and save the result as a csv file for use. But... as always, it doesn't work.

한국어 버전 보기

Summary

Take the harvest and start with the summary. Most CSV file reading programming examples found on the Internet omit escaping for commas, so commas cannot be used in the CSV body or there is an error that the number of columns increases if used. In this post, UTF8 encoded data written in Excel Let's take a look at the part that is often overlooked when making a program that reads csv files.

  • If you save a UTF8 csv file in Excel, it is saved as UTF8-BOM (3 bytes are added in front of the file).
  • If you simply separate columns with a comma (,), you cannot use text including commas.

You've probably heard of UTF-8, but what is BOM? In csv, which is an abbreviation of Comma Separated Value, if you can't separate them with commas, how do you tell them to be separated? No matter what you think, just hold on for 3 seconds and keep reading. 

In this post, we are talking about what the above problems are and how to solve them. But I'm not curious about that and if you just need a sample code that works properly  [here go to ]

Example.csv

Before I explain, let's assume you have a csv file with the following data. The part you should pay attention to is the contents of the Mail_Message column. The first is written in 'Hangul', the second is a comma (,) is included in the value, and the third is a complex set of strings that are difficult to process in which quotation marks, commas, Korean, and special characters are combined.

Index Mail_Message  Mail_Expire_Day   Reward_Index Reward_Count
1 한글 이름 여덟 글자 14 100000 10
2 Comma,Seperated,Value 7 100001 30
3 "Event","한글","123","!@#" 365 100002 5

At first..

At first, I thought that the program to parse the csv file would simply read the file, split it into individual lines, and separate each column with a comma (,) as in the C++ code below..but..

std::ifstream file(filePath);
if(true == file.fail())
{
    // If the file read fails, bla bla...
}

std::string cell;
std::string	line;

while(std::getline(file, line))
{
    std::stringstream lineStream(line);
    while (std::getline(lineStream, cell, ','))
    {
        // Code to read the file...
    }
}

But... things are not so easy in the world.

반응형

Issue 1. UTF-8 BOM issue

I read the file, but the first few characters kept breaking. For example, in the case of the csv above, when the first line is read, it should come out like "Index,Mail_Message,Mail_Expire_Day,Item_Index,Item_Count", but the first text is broken. The broken text as shown below.

"뷁쀆??ndex,Mail_Message. .."

The reason is that Microsoft Excel defaults to utf8-bom(Byte Order Mark) is exported.
In fact, byte order mark is required for utf8. However, it is automatically inserted into the ut8 document generated by MS products.
So, if the first 3 bytes of the csv file created using Excel are 0xEF, 0xBB, 0xBF, it is considered a mark indicating bom and needs to be ignored.

Problem 2. Comma and quotation marks

If you open the above csv data in a text editor, it is as follows. Something has more text than our csv file.

Index,Mail_Message,Mail_Expire_Day,Item_Index,Item_Count 
1,한글이름 여덟글자,14,100001,10 
2,"Comma,Seperated,Value",14,100002,10000 
3,"""Event"",""한글"",""123"",""!@#""",365,100003,10

If you think that only commas are the standard for dividing columns without thinking, the 2nd data will have 7 columns and the 3rd data will increase to 8. If a delimiter character is used in the body, the number of columns increases arbitrarily. Therefore, when creating a csv, Excel uses the following rules to escape the delimiter (comma).

  • Comma,Seperated,Value text is enclosed in quotation marks (") around it, which is a single column, not separate columns separated by commas. indicates that it is.
Comma,Sperated,Value -> "Comma,Seperated,Value"
  • "Event","한글","123","!@#" text that contains both quotation marks and commas , and at the same time add a single quotation mark as an escape character before the original quotation mark to indicate that it is not an end-quote quotation mark.
"Event","한글","123","!@#" -> """Event"",""한글"",""123"",""!@#"""

I changed the code as follows to handle the above rules and treat the comma that appears before opening and closing the quotation marks as plain text, and delete the quotation marks added as an escape character.< /p>

for(size_t i = pos; i<str.size(); i++)
{
    switch(str[i])
    {
        case '"' :
            ++quotes;
            break;
        case ',' :
        case '\n' :
            if(0 == quotes || ('"' == prev && 0 == quotes % 2))
            {
                if(2 <= quotes) // Remove leading and trailing quotes
                {
                    cell = cell.substr(1);
                    cell = cell.substr(0, cell.size() - 1);
                }
                if(2 < quotes) // Remove escaped quotes within text
                {
                    std::string::size_type findPos = 0;
                    std::string::size_type offset = 0;
                    const std::string pattern("\"\"");
                    const std::string replace("\"");
                    while((findPos = cell.find(pattern, offset)) != std::string::npos)
                    {
                        cell.replace(cell.begin() + findPos, cell.begin() + findPos + pattern.size(), "\"");
                        offset = findPos + replace.size();
                    }
                }
				
                // If you come this far, you have finished reading one column.
                row.push_back(cell);
                cell = "";
                prev = 0;
                quotes = 0;
                if ('\n' == str[i])
                {
                    // If you come this far, read one line
                    rows.push_back(row);
                    row.clear();
                }
                continue;
            }
            break;
        default :
            break;
        }
        cell += prev = str[i];
    }
}

End

In most CSV parsing codes easily found on the Internet, commas cannot be used in the body because the comma delimiter is not escaped, or the number of columns increases if used. has a In this post, we briefly looked at how to escape UTF-8 BOM and delimiter commas.

If you ever need to use CSV, please write it in Excel.

Appendix 1. Full code

C++

  • CSVReader.h
#ifndef _CSV_READER_H_
#define _CSV_READER_H_

#include <map>
#include <vector>
#include <string>

class CSVReader
{
public:
    class iterator
    {
    public:
        iterator(std::vector<std::vector<std::string>>::const_iterator itr);

        const std::vector<std::string>& operator * () const;
        iterator& operator ++ ();
        iterator& operator ++ (int);
        iterator* operator -> ();
        bool operator != (const iterator& itr) const;
        bool operator == (const iterator& itr) const;

        const std::string& GetValue(int index);
    private:
		std::vector<std::vector<std::string>>::const_iterator row_iterator;
	};

	typedef iterator Row;
public:
	bool ReadFile(const std::string& filePath);
	bool ReadStream(const std::string& stream);
		
	size_t GetRowCount() const;
	const std::vector<std::string>& GetRow(size_t rowIndex) const;
	const std::string& GetCell(int rowIndex, int columnIndex) const;
	const std::vector<std::string>& operator[](size_t rowIndex) const;

	iterator begin() const;
	iterator end() const;
private:
	std::vector<std::vector<std::string>> rows;
};

#endif
  • CSVReader.cpp
#include "CSVReader.h"
#include <fstream>
#include <streambuf>
#include <algorithm>
#include <sstream>

CSVReader::iterator::iterator(std::vector<std::vector<std::string>>::const_iterator itr)
    : row_iterator(itr)
{
}

const std::string& CSVReader::iterator::GetValue(int index)
{
    const std::vector<std::string>& row = (*row_iterator);
    if (0 > index || row.size() <= index)
    {
        throw std::out_of_range("invalid csv column index:" + std::to_string(index));
    }
    return row[index];
}

const std::vector<std::string>& CSVReader::iterator::operator * () const
{
    return *row_iterator;
}

CSVReader::iterator& CSVReader::iterator::operator ++ ()
{
    row_iterator++;
    return *this;
}

CSVReader::iterator& CSVReader::iterator::operator ++ (int)
{
    row_iterator++;
    return *this;
}

CSVReader::iterator* CSVReader::iterator::operator -> ()
{
    return this;
}

bool CSVReader::iterator::operator != (const CSVReader::iterator& itr) const
{
    if (row_iterator != itr.row_iterator)
    {
        return true;
    }
    return false;
}

bool CSVReader::iterator::operator == (const CSVReader::iterator& itr) const
{
    if (row_iterator == itr.row_iterator)
    {
        return true;
    }
    return false;
}

bool CSVReader::ReadFile(const std::string& filePath)
{
    std::ifstream file(filePath);
    if (true == file.fail())
    {
        throw std::ifstream::failure("fail to open file(path:" + filePath + ")");
    }

    std::string stream((std::istreambuf_iterator<char>(file)), std::istreambuf_iterator<char>());
    return ReadStream(stream);
}

bool CSVReader::ReadStream(const std::string& str)
{
    std::stringstream stream(str);
    std::string line;
    std::string cell;

    char utf8bom[3] = {};
    std::streamsize size = stream.readsome(utf8bom, 3);
    if(3 == size)
    {
        if ((char)0xEF == utf8bom[0] && (char)0xBB == utf8bom[1] && (char)0xBF == utf8bom[2])
        {
            stream.seekg(3);
        }
    }
    
    // read data
    {
        std::vector<std::string> row;
        std::streampos pos = stream.tellg();
        int quotes = 0;
        char prev = 0;

        std::string cell;
        for (size_t i = pos; i < str.size(); i++)
        {
            switch (str[i])
            {
            case '"':
                ++quotes;
                break;
            case ',':
            case '\n':
                if (0 == quotes || ('"' == prev && 0 == quotes % 2))
                {
                    if (2 <= quotes)
                    {
                        cell = cell.substr(1);
                        cell = cell.substr(0, cell.size() - 1);
                    }
                    if (2 < quotes)
                    {
                        std::string::size_type findPos = 0;
                        std::string::size_type offset = 0;
                        const std::string pattern("\"\"");
                        const std::string replace("\"");
                        while ((findPos = cell.find(pattern, offset)) != std::string::npos)
                        {
                            cell.replace(cell.begin() + findPos, cell.begin() + findPos + pattern.size(), "\"");
                            offset = findPos + replace.size();
                        }
                    }
                    row.push_back(cell);
                    cell = "";
                    prev = 0;
                    quotes = 0;
                    if ('\n' == str[i])
                    {
                        rows.push_back(row);
                        row.clear();
                    }
                    continue;
                }
                break;
            default:
                break;
            }
            cell += prev = str[i];
        }
    }
    
    return true;
}

size_t CSVReader::GetRowCount() const
{
    return rows.size();
}

CSVReader::iterator CSVReader::begin() const
{
    return iterator(rows.begin());
}

CSVReader::iterator CSVReader::end() const
{
    return iterator(rows.end());
}

const std::vector<std::string>& CSVReader::GetRow(size_t rowIndex) const
{
    if (0 > rowIndex || rows.size() <= rowIndex)
    {
        throw std::out_of_range("out of range row num:" + std::to_string(rowIndex));
    }
    return rows[rowIndex];
}

const std::vector<std::string>& CSVReader::operator [] (size_t rowIndex) const
{
    return GetRow(rowIndex);
}

const std::string& CSVReader::GetCell(int rowIndex, int columnIndex) const
{
    const std::vector<std::string>& row = GetRow(rowIndex);

    if (0 > columnIndex || row.size() <= columnIndex)
    {
        throw std::out_of_range("invalid csv column index:" + std::to_string(columnIndex));
    }
    return row[columnIndex];
}
  • Program.cpp
#include <iostream>
#include "CSVReader.h"
#ifdef _WIN32
#include <Windows.h>
#endif

int main()
{
#ifdef _WIN32
    SetConsoleOutputCP(CP_UTF8);
#endif
    /* Example.csv
        Index, Message, ExpireDay, RewardIndex, RewardCount
        1, contents, 14, 100000, 10
        2, "Comma,Seperated,Text", 7, 100001, 30
        3, """Event"",""contents"",""123"",""!@#""    ", 365, 100002, 5
    */

    CSVReader reader;
    reader.ReadFile("Example.csv");

    for(int i = 0; i<reader.GetRowCount(); i++)
    {
        auto& row = reader.GetRow(i);
        for(int j=0; j<row.size(); j++)
        {
            std::cout << row[j] << " ";
        }
        std::cout << std::endl;
    }

    for (auto& rows : reader)
    {
        for (auto& row : rows)
        {
            std::cout << row << " ";
        }
        std::cout << std::endl;
    }

    std::cout << reader.GetCell(1, 1) << std::endl;
}

C#

Note: In the linked C# code, the first line is the column name, the second line is the column type, and the third line contains data. If you use the format used as an example in the post, the first line of data will be ignored. If you use the code, we recommend that you take note of this and modify it to fit your own format.

Appendix 2. Good article to read together

유익한 글이었다면 공감(❤) 버튼 꾹!! 추가 문의 사항은 댓글로!!