Python - Combine all CSV Files in Folder

In this tutorial, we will demonstrate different Python-based methods for merging or combining numerous CSV data into a single file (this method also applies to text and other types of files). There will be a bonus lesson on how to quickly merge numerous CSV files for Linux. Finally, you can combine thousands of files with complete control over the imported data by converting the all-CSV files into the Pandas DataFrame and afterwards marking each row with the CSV file that it came from.

How to use Python to combine many identical CSV files Please take note that we presumptively presume that all files contain the same columns and information. Concatenating the CSV files in the Downloads folder using a short code sample.

Dataset

Five dummy sales files that CSV format that I'll be using in the next tutorial step may be found in this folder. Each file has precisely the same number of columns and exactly 1 million rows. For instance, the sales_recors_n1.csv file has the following format when opened.

The listdir() method inside the os Python module can be used to list the files' contents after you have downloaded them into a certain directory on your laptop.

In order to use this method, you must provide the directory's exact path, as seen below:

Take note of the Python list that this method produces, which contains the all files inside the sales csv directory. This is advantageous since it enables iterative file reading with the object.

  • Merge Multiple CSV Files

The first step's objective is to use Python to combine five CSV files into a single dataset with 5 million rows. I'll utilise the os or pandas packages to accomplish it. The complete Python script is as follows to accomplish that

Python - Combine all CSV Files in Folder

Let me now walk you through the code's action step by step.

  1. I'm importing the necessary Python packages first, then I'm specifying the location of the CSV files. I advise you to keep the all-CSV files we intend to combine in one folder. Our life will be easier as a result.
  2. Next, I'm making a file list by adding the path to each record that adheres to a particular naming pattern-in this case, all the documents that begin with sales records n-and then combining those file lists into one. Sometimes it's vital to choose files based on name because our working directory could potentially contain a lot of irrelevant files. Take note of how file list was produced using a list comprehension.

When we print file list's contents, we get the following:

Output

['/Users/anbento/Documents/sales/sales_csv/sales_records_n5.csv', '/Users/anbento/Documents/sales/sales_csv/sales_records_n4.csv', '/Users/anbento/Documents/sales/sales_csv/sales_records_n1.csv', '/Users/anbento/Documents/sales/sales_csv/sales_records_n3.csv', '/Users/anbento/Documents/sales/sales_csv/sales_records_n2.csv']

All of the files are now parse-ready and have their paths associated.

Although the list is unordered, keep in mind that if the sequence in which the files were combined is crucial to then, we will need to utilise the sorted () function when iterating through to the list.

3. we now start by making an empty csv list. we then use pandas' read csv () method to continuously read any CSV file in the file list and add those datasets to the csv list.

It should be noted that when a CSV file is parsed using read csv (), the datasets are automatically transformed to pandas DFs, therefore csv_list now contains 5 distinct pandas DFs. However, we are also attaching the following command in addition to parsing the files.

This adds a new column with the title of the source CSV file to each DF so that, when files are combined, it is clear which file is which.

The files are eventually combined into a singular file pandas DF by applying concat() method into it. A new ordered index would be created for csv merged if ignore index is set to True.

The final step is to transform csv merged from a pandas DF situated in the identical directory. To accomplish this, use the to csv() command. When index=False is used, it indicates that the index-containing column must not be added.

Output

['sales_records_full.csv',
 'sales_records_n1.csv',
 'sales_records_n2.csv',
 'sales_records_n3.csv',
 'sales_records_n4.csv',
 'sales_records_n5.csv']

It just only 7 lines of code plus a few seconds of run-time, with the exception of the package imports, to get this outcome.






Latest Courses