I want to read some csv files from my folder and concatenate them to a big pandas dataframe. All of my csv files end with a number, and I only want to read files whose number end with (6~10, 16~20, 26~30.) My goal is to read the files iteratively. Attached is my code so far:
import pandas as pd
data_one = pd.read_csv('Datafile-6.csv', header=None)
for i in range(7,11):
data99 = pd.read_csv('Datafile-'+i+'*.csv', header=None) #this line needs work
data_one = pd.concat([data_one, data99.iloc[:,1]],axis=1,ignore_index=True)
data_two = pd.read_csv('Datafile-16.csv', header=None)
for j in range(17,21):
#Repeat similar process
What should I do about 'data99' such that 'data_one' contains columns from 'Datafile-6' through 'Datafile-10'?
The first five rows of data_one should look like this, after getting data from Datafiles 6-10.
0 1 2 3 4 5
0 -40.0 0.179836 0.179630 0.179397 0.179192 0.179031
1 -39.0 0.183696 0.183441 0.183204 0.182977 0.182795
2 -38.0 0.186720 0.186446 0.186191 0.185949 0.185762
3 -37.0 0.189490 0.189207 0.188935 0.188686 0.188475
4 -36.0 0.192154 0.191851 0.191569 0.191301 0.191086
Column 0 is included in all of the data files, so I'm only concatenating column 1 of all of the subsequent data files.
You need to use glob module:
import glob, os
import pandas as pd
path =r'C:\YourFolder' #path to folder with .csv files
all = glob.glob(path + "/*.csv")
d_frame = pd.DataFrame()
list_ = []
for file_ in all:
df = pd.read_csv(file_,index_col=None, header=0)
if df['YourColumns'].tail(1).isin([6,7,8,9,10,16,17,18,19,20,26,27,28,29,30]) == True: #You can modify list with conditions you need
list_.append(df)
d_frame = pd.concat(list_)
Related
I have a large number of .csv files in a folder. All .csv files have the same column names. The below code merges all the .csv files. But I have to merge the top 10 .csv files in one DataFrame after that 11 to 20 in the next step and so on... The solution 1 and solution 2 are suitable if file names are numeric but in my case file names are not following any pattern.
# Merge .csv files in one place
import glob
import os
import pandas as pd
path = r'D:\Course\Research\Data\2017-21'
print(path)
all_files = glob.glob(os.path.join(path, "*.csv"))
df_from_each_file = (pd.read_csv(f,encoding='utf8',error_bad_lines=False) for f in all_files)
merged_df = pd.concat(df_from_each_file)
Further to my comment above, here is a more simple solution.
All required CSV files are collected by glob. In its current state, the list is not sorted, but can be according to your requirements
The list of files is iterated in 10-file-chunks
Each chunk is read and concatenated together into the merged DataFrame: dfm
Do whatever you like with the DataFrame
The to_csv example uses a random 4-byte hex string to ensure uniqueness* over the output files
*Note: This is not guaranteed uniqueness, but will suffice with the 50 sample data files I was using.
Sample code:
import os
import pandas as pd
from glob import glob
dfm = pd.DataFrame()
files = glob(os.path.join('./csv2df', 'file*.csv')) # 50 CSV files
for i in range(0, len(files), 10):
dfm = pd.concat(pd.read_csv(f) for f in files[i:i+10])
# Do whatever you want with the merged DataFrame.
print(dfm.head(10), dfm.shape)
print('\n')
# Write to CSV?
dfm.to_csv(f'./csv2df/merged_{os.urandom(4).hex()}.csv', index=False)
Output:
The following is a sample output from the print statements:
col1 col2 col3 col4
0 file49 file49 file49 file49
1 data1.1 data1.2 data1.3 data1.4
2 data2.1 data2.2 data2.3 data2.4
3 data3.1 data3.2 data3.3 data3.4
4 data4.1 data4.2 data4.3 data4.4
5 data5.1 data5.2 data5.3 data5.4
0 file30 file30 file30 file30
1 data1.1 data1.2 data1.3 data1.4
2 data2.1 data2.2 data2.3 data2.4
3 data3.1 data3.2 data3.3 data3.4 (60, 4)
...
col1 col2 col3 col4
0 file14 file14 file14 file14
1 data1.1 data1.2 data1.3 data1.4
2 data2.1 data2.2 data2.3 data2.4
3 data3.1 data3.2 data3.3 data3.4
4 data4.1 data4.2 data4.3 data4.4
5 data5.1 data5.2 data5.3 data5.4
0 file42 file42 file42 file42
1 data1.1 data1.2 data1.3 data1.4
2 data2.1 data2.2 data2.3 data2.4
3 data3.1 data3.2 data3.3 data3.4 (60, 4)
CSV file list:
merged_5314ad49.csv
merged_5499929e.csv
merged_5f4e306a.csv
merged_74746bd8.csv
merged_b9def1d6.csv
Here's a suggestion that is using islice() from the standard library module itertools to fetch chunks of up to 10 files:
from pathlib import Path
from itertools import islice
import pandas as pd
csv_files = Path(r"D:\Course\Research\Data\2017-21").glob("*.csv")
while True:
files = list(islice(csv_files, 10))
if not files:
break
dfs = (pd.read_csv(file) for file in files)
merged_df = pd.concat(dfs, ignore_index=True)
# Do whatever you want to do with merged_df
print(merged_df)
(I'm also using the standard library module pathlib because it's more convenient.)
I'm a newbie in python and need help with this piece of code. I did a lot of search to get to this stage but couldn't fix it on my own. Thanks in advance for your help.
What I'm trying to do is that I have to compare 100+ csv files in a folder, and not all have the same number of columns or columns name. So I'm trying to use python to read the headers of each file and put in a csv file to output in a folder.
I got to this point but not sure if I'm on the right path even:
import pandas as pd
import glob
path = r'C:\Users\user1\Downloads\2016GAdata' # use your path
all_files = glob.glob(path + "/*.csv")
list1 = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
list1.append(df)
frame = pd.concat(list1, axis=0, ignore_index=True)
print(frame)
thanks for your help!
You can create a dictionary whose key is filename and value is dataframe columns. Using this dictionary to create dataframe results in filename as index and column names as column value.
d = {}
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
d[filename] = df.columns
frame = pd.DataFrame.from_dict(d, orient='index')
0 1 2 3
file1 Fruit Date Name Number
file2 Fruit Date Name None
I have many csv files in a directory with two column each
miRNA read_counts
miR1 10
miR1 5
miR2 2
miR2 3
miR3 100
I would like to sum read_counts if the miRNA id is the same.
Result:
miRNA read_counts
miR1 15
miR2 5
miR3 100
To do that I wrote a little script. However I don't know how to loop it through all my csv files so I don't have to copy paste file names and output each time. Any help will be very appreciated. Thanks for the help!
import pandas as pd
df = pd.read_csv("modified_LC1a_miRNA_expressed.csv")
df_new = df.groupby('miRNA')['read_count'].sum()
print(df_new)
df_new.to_csv('sum_LC1a_miRNA_expressed.csv')
Try looking into glob module.
from glob import glob
import os
path = "./your/path"
files = glob(os.path.join(path, "*.csv"))
dataframes = []
for file in files:
df = pd.read_csv(file)
# rest you would want to append these to dataframes
dataframes.append(df)
Then, use pd.concat to join the dataframes and perform the groupby operation.
EDIT 1:
Based on the request mentioned in the comment:
results = {}
for file in files:
df = pd.read_csv(file)
# perform operation
df_new = df.groupby('miRNA')['read_count'].sum()
results[file] = df_new
Not trying to steal the answer. I would have put this in a comment under #Asif Ali's answer if I had enough rep.
Assuming all input .csv files follow the format:
"modified_{rest_of_the_file_name}.csv"
And you want the outputs to be:
"sum_{same_rest_of_the_file_name}.csv"
import os
import glob
path = "./your/path"
files = glob.glob(os.path.join(path, "*.csv"))
for file in files:
df = pd.read_csv(file)
df_new = df.groupby('miRNA')['read_count'].sum()
print(df_new)
df_new.to_csv(file.split('modified')[:-1] + \
'sum' + \
'_'.join(file.split('modified')[-1:]))
I'm reading multiple csv files and combining them into a single dataframe like below:
pd.concat([pd.read_csv(f, encoding='latin-1') for f in glob.glob('*.csv')],
ignore_index=False, sort=False)
Problem:
I want to add a column that doesn't exist in any csv (to the dataframe) based on the csv file name for every csv file that is getting concatenated to the dataframe. Any help will be appreciated.
glob.glob returns normal string so you can just add a column to every individual dataframe in a loop.
Assuming you have files df1.csv and df2.csv in your directory:
import glob
import pandas as pd
files = glob.glob('df*csv')
dfs = []
for file in files:
df = pd.read_csv(file)
df['filename'] = file
dfs.append(df)
df = pd.concat(dfs, ignore_index=True)
df
a b filename
0 1 2 df1.csv
1 3 4 df1.csv
2 5 6 df2.csv
3 7 8 df2.csv
I have multiple csv files in my local directory. Each filename contains some numbers. Some of those numbers identify years for which the file is. I need to add a column year to each file that I'm concatenating and while I do I want to get the year information from the filename and insert it into that column. I'm using regex to extract the year and concatenate it like 20 + 11 = 2011. Then, I'm setting the column's data type to int32.
pd.concat(
[
pd.read_csv(f)
.assign(year = '20' + re.search('[a-z]+(?P<year>[0-9]{2})', f).group('year'))
.astype({'year' : 'int32'})
for f in glob.glob('stateoutflow*[0-9].csv')
],
ignore_index = True
)
I have tried to work on several csv files using glob, for example:
import glob
import pandas as pd
import numpy as np
import csv
# Read all csv files with same file name in the folder
filenames = sorted(glob.glob('./16_2018-02*.csv'))
for f in filenames:
df = pd.read_csv(f, names=['Date','RSSI','Data','Code'],
index_col=None)
# Slicing information
df["ID"] = df["Data"].str.slice(0,2)
df["X"] = df["Data"].str.slice(2,4)
# Save the output data to csv with different name
df.to_csv(f'{f[:-4]}-train.csv', index=False)
In the end of the code, I used to save each dataframe into a new csv file with different name. Considering now I have so many csv data to work with, I want to concatenate them without first writing into each csv file. How should I do that?
Original dataset first 5 rows:
Date RSSI Data Code
2018-02-20T00:00:20.886+09:00 -99 1068ffd703d101ec77f425ea98b201 F2D5
2018-02-20T00:00:21.904+09:00 -95 103cffbc032901ee77f49dea98b301 F2D5
2018-02-20T00:00:22.415+09:00 -97 103cffbc032901ee77f49dea98b301 F2D5
2018-02-20T00:00:46.580+09:00 -96 10fdfda803ff01f477f49dfd98cb03 F2D1
2018-02-20T00:00:48.593+09:00 -96 101bfed3037401f577f49dfe98cd03 F2D6
After:
Date RSSI Data Code ID X
2018-02-20T00:00:20.886+09:00 -99 1068ffd703d101ec77f425ea98b201 F2D5 16 0.065384
2018-02-20T00:00:21.904+09:00 -95 103cffbc032901ee77f49dea98b301 F2D5 16 0.065340
2018-02-20T00:00:22.415+09:00 -97 103cffbc032901ee77f49dea98b301 F2D5 16 0.065340
2018-02-20T00:00:46.580+09:00 -96 10fdfda803ff01f477f49dfd98cb03 F2D1 16 0.065021
2018-02-20T00:00:48.593+09:00 -96 101bfed3037401f577f49dfe98cd03 F2D6 16 0.065051
Try the below code [for appending all the files to 1 file]:
filenames = sorted(glob.glob('./16_2018-02*.csv'))
appended_data=[] #create a list
for f in filenames:
df = pd.read_csv(f, names=['Date','RSSI','Data','Code'],
index_col=None)
# Slicing information
df["ID"] = df["Data"].str.slice(0,2)
df["X"] = df["Data"].str.slice(2,4)
appended_data.append(df) #append to the list
appended_data = pd.concat(appended_data, axis=1) #concat them together
#remove axis=1 if need to append vertically
The appended_data is now a dataframe with all files appended together post which you can export the same to csv/excel.