[Solved] Read CSV and the delimiter problem of dog blood, with solutions!

hello, my name is zhenguo

Today, I would like to share with you a very bloody problem encountered, the problem of reading the csv file about the delimiter.

1

After using pandas to read in the csv file, I found that the columns were not split, so I adjusted the sep parameter to \t, and found that it was still not split, try the space again, and try again\s + , that is, various combinations of whitespace characters, some of which can be separated, but some columns cannot be separated.

Thinking about it, can’t figure it out. Then go to check whether the csv file has a tool to automatically parse the delimiter. In fact, this tool is not difficult to do. Find the delimiter rule of each line again, and give different delimiters according to different probabilities.

I haven’t found it on the Internet, and generally there are no answers to the questions on the Internet. There are mainly two types of questions. Either the related field of the question is extremely niche, or the question is not a problem at all, and it may be a problem caused by an extremely low-level mistake caused by stupidity. .

Obviously, reading a csv column that cannot be divided is not a niche problem, so it should be caused by stupidity.

Sure enough, after I observed again and again and discussed in the group, Brother Harvard reminded me, I realized that the read-in file was not divided, that is, the data format of one row and one column, so the problem was in the read-in file.

2

As an extension, I would like to say that a criticized problem of csv files is caused by the delimiter.

The following file a.csv, the delimiter is a comma, you pay attention to the Hi,pythoner cell, its value contains a comma

ed43b7184ff487f81b072fad914e8b72.png

What happens when I read in this file using pandas:

import pandas as pd
pd.__version__ # '1.2.4'
pd.read_csv('a.csv', index_col=False)

After reading in, the value of the Hi,pythoner cell is truncated to Hi

2aaa385348b2eca9f03341c7d8372891.png

If there are more than 1 comma in multiple cells, an exception will be thrown because the column cannot be used for it. For this reason, read_csv also provides a parameter error_bad_lines, which is specially discarded. Lines with multiple commas are especially prone to such errors when there is a large amount of data. In order to read the data in the first time, error_bad_lines is often set to False, that is, such lines with multiple commas are discarded.

If the delimiter of the csv file is \t or other, it also faces the same problem. If the delimiter happens to appear in the cell, this kind of error is inevitable.

3

If your data happens to have a large number of delimited lines, this needs attention.

A safer way to do this is to replace the delimiter of the csv file that appears in the cell with other symbols, such as the delimiter is a comma, and the comma in the replacement cell is a space; such as \t , replace the \t of the cell with a comma.

In this way, after a replacement process, the problem of missing data and some rows being filtered will no longer occur.

I am carefully creating a short video of learning Python from scratch on the video account. At present, the audience has exceeded 11,000, and it has been broadcast to the 80th episode. Please pay attention to me – programmer zhenguo, this is the 80th episode:

Click to watch