I have several pandas DataFrames of the same format, with five columns.
I would like to sum the values of each one of these dataframes using df.sum(). This will create a Series for each Dataframe, still with 5 columns.
My problem is how to take these Series, and create another Dataframe, one column being the filename, the other columns being the five columns above from df.sum()
import pandas as pd
import glob
batch_of_dataframes = glob.glob("*.txt")
newdf = []
for filename in batch_of_dataframes:
df = pd.read_csv(filename)
df['filename'] = str(filename)
df = df.sum()
newdf.append(df)
newdf = pd.concat(newdf, ignore_index=True)
This approach doesn't work unfortunately. 'df['filename'] = str(filename)' throws a TypeError, and the creating a new dataframe newdf doesn't parse correctly.
How would one do this correctly?
How do you take a number of pandas.Series objects and create a DataFrame?
Try in this order:
Create an empty list, say list_of_series.
For every file:
load into a data frame, then save the sum in a series s
add an element to s: s['filename'] = your_filename
append s to list_of_series
Finally, concatenate (and transpose if needed):
final_df = pd.concat(list_of_series, axis = 1).T
Code
Preparation:
l_df = [pd.DataFrame(np.random.rand(3,5), columns = list("ABCDE")) for _ in range(5)]
for i, df in enumerate(l_df):
df.to_csv(str(i)+'.txt', index = False)
Files *.txt are comma separated and contain headers.
! cat 1.txt
A,B,C,D,E
0.18021800981245173,0.29919271590063656,0.09527248614484807,0.9672038093199938,0.07655003742768962
0.35422759068109766,0.04184770882952815,0.682902924462214,0.9400817219440063,0.8825581077493059
0.3762875793116358,0.4745731412494566,0.6545473610147845,0.7479829630649761,0.15641907539706779
And, indeed, the rest is quite similar to what you did (I append file names to a series, not to data frames. Otherwise they got concatenated several times by sum()):
files = glob.glob('*.txt')
print(files)
['3.txt', '0.txt', '4.txt', '2.txt', '1.txt']
list_of_series = []
for f in files:
df = pd.read_csv(f)
s = df.sum()
s['filename'] = f
list_of_series.append(s)
final_df = pd.concat(list_of_series, axis = 1).T
print(final_df)
A B C D E filename
0 1.0675 2.20957 1.65058 1.80515 2.22058 3.txt
1 0.642805 1.36248 0.0237625 1.87767 1.63317 0.txt
2 1.68678 1.26363 0.835245 2.05305 1.01829 4.txt
3 1.22748 2.09256 0.785089 1.87852 2.05043 2.txt
4 0.910733 0.815614 1.43272 2.65527 1.11553 1.txt
To answer this specific question :
#ThomasTu How do I go from a list of Series with 'Filename' as a
column to a dataframe? I think that's the problem---I don't understand
this
It's essentially what you have now, but instead of appending to an empty list, you append to an empty dataframe. I think there's an inplace keyword if you don't want to reassign newdf on each iteration.
import pandas as pd
import glob
batch_of_dataframes = glob.glob("*.txt")
newdf = pd.DataFrame()
for filename in batch_of_dataframes:
df = pd.read_csv(filename)
df['filename'] = str(filename)
df = df.sum()
newdf = newdf.append(df, ignore_index=True)
Related
I am trying to select a specific column, with the header "Average", from multiple csv files. Then take the "Average" column from each of those multiple csv files and merge them into a new csv file.
I left the comments in to show the other ways I tried to accomplish this:
procdir = r"C:\Users\ChromePnP\Desktop\exchange\processed"
collected = os.listdir(procdir)
flist = list(collected)
flist.sort()
#exclude first files in list
rest_of_files = flist[1:]
for f in rest_of_files:
get_averages = pd.read_csv(f, usecols = ['Average'])
#df1 = pd.DataFrame(f)
# df2 = pd.DataFrame(rundata_file)
#get_averages = pd.read_csv(f)
#for col in ['Average']:
#get_averages[col].to_csv(f_template)
got_averages = pd.merge(get_averages, right_on = 'Average')
got_averages.to_csv("testfile.csv", index=False)
EDIT:
I was able to get the columns I wanted, and they will print. However now the saved file only has a single average column from the loop, instead of saving all the columns selected in the loop.
rest_of_files = flist[1:]
#f.sort()
print(rest_of_files)
for f in rest_of_files:
get_averages = pd.read_csv(f)
df1 = pd.DataFrame(get_averages)
got_averages = df1.loc[:, ['Average']]
print(got_averages)
f2_temp = pd.read_csv(rundata_file)
df2 = pd.DataFrame(f2_temp)
merge_averages = pd.concat([df2, got_averages], axis=1)
merge_averages.to_csv(rundata_file, index=False)
Either you use pd.merge with argument left and right as specified here :
got_averages = pd.merge(got_averages, get_averages, right_on = 'Average')
Or you use .merge for dataframe, doc here :
got_averages = got_averages.merge(get_averages, right_on = 'Average')
Keep in mind you need to initialize got_averages (as empty dataframe for instance) before using it in your for loop
I want to:
Read a file into a dataframe
Do some data manipulation, etc.
Copy one column from the dataframe
Append that column to a second dataframe
Repeat 1-4 until all files are read
My implementation is:
all_data = [[]] #list to store each set of values
for i in file_list:
filepath = path + i
df=pd.read_csv(filepath,sep='\t',header=None,names=colsList)
#various data manipulation, melt, etc, etc, etc.
all_data.append(df['value'])
df_all = pd.DataFrame(all_data)
df_all=df_all.T #Transpose
df_all.set_axis(name_list, axis=1, inplace=True) #fix the column names
How could this have been better implemented?
Problems:
the data in the python list is transposed (appended by rows not columns)
I couldn't find a way to append by columns or transpose the list (with python list or with pandas) that would work without an error :(
Thanks in advance...
If you would keep data in dictionary then you would get columns.
But every column need uniq name - i.e. col1, col2, ect.
import pandas as pd
all_data = {}
all_data['col1'] = [1,2,3]
all_data['col2'] = [4,5,6]
all_data['col3'] = [7,8,9]
new_df = pd.DataFrame(all_data)
print(new_df)
Result:
col1 col2 col3
0 1 4 7
1 2 5 8
2 3 6 9
The same with for-loop
I use io.StringIO only to simulate files in memory - but you should use directly path to file.
import pandas as pd
import io
file_data = {
'file1.csv': '1\t101\n2\t102\n3\t103',
'file2.csv': '4\t201\n5\t202\n6\t202',
'file3.csv': '7\t301\n8\t301\n9\t201',
}
file_list = [
'file1.csv',
'file2.csv',
'file3.csv',
]
# ---
all_data = {}
for number, i in enumerate(file_list, 1):
df = pd.read_csv( io.StringIO(file_data[i]), sep='\t', header=None, names=['value', 'other'] )
all_data[f'col{number}'] = df['value']
new_df = pd.DataFrame(all_data)
print(new_df)
You can also directly assign new column
new_df[f'column1'] = old_df['value']
import pandas as pd
import io
file_data = {
'file1.csv': '1\t101\n2\t102\n3\t103',
'file2.csv': '4\t201\n5\t202\n6\t202',
'file3.csv': '7\t301\n8\t301\n9\t201',
}
file_list = [
'file1.csv',
'file2.csv',
'file3.csv',
]
# ---
new_df = pd.DataFrame()
for number, i in enumerate(file_list, 1):
df = pd.read_csv( io.StringIO(file_data[i]), sep='\t', header=None, names=['value', 'other'] )
new_df[f'col{number}'] = df['value']
print(new_df)
I am trying to save multiple dataframes to csv in a loop using pandas, while keeping the name of the dataframe.
import pandas as pd
df1 = pd.DataFrame({'Col1':range(1,5), 'Col2':range(6,10)})
df2 = pd.DataFrame({'Col1':range(1,5), 'Col2':range(11,15)})
frames = [df1,df2]
for data in frames:
data['New'] = data['Col1']+data['Col2']
for data in frames:
data.to_csv('C:/Users/User/Desktop/{}.csv'.format(data))
This doesn't work, but the outcome I am looking for is for both dataframes to be saved in CSV format, to my desktop.
df1.csv
df2.csv
Thanks.
You just need to set the names of the CSV files; like so:
names = ["df1", "df2"]
for name, data in zip(names, frames):
data.to_csv('C:/Users/User/Desktop/{}.csv'.format(name))
Hope this help. Note I did not use the format function. But I set up code in the directory I am working on.
import pandas as pd
df1 = pd.DataFrame({'Col1':range(1,5), 'Col2':range(6,10)})
df2 = pd.DataFrame({'Col1':range(1,5), 'Col2':range(11,15)})
frames = [df1,df2]
for data in frames:
data['New'] = data['Col1']+data['Col2']
n = 0
for data in frames:
n = n + 1
data.to_csv('df' + str(n) + ".csv")
In this loop:
for data in frames:
data.to_csv('C:/Users/User/Desktop/{}.csv'.format(data))
You are looping over a list of DataFrame objects so you cannot use them in a string format.
Instead you could use enumerate() to get the indexes as well as the objects. Then you can use the indexes to format the string.
for idx,data in enumerate(frames):
data.to_csv('df{}.csv'.format(idx + 1))
# the reason for adding 1 is to get the numbers to start from 1 instead of 0
Otherwise you can loop through your list just using the index like this:
for i in range(len(frames)):
frames[i].to_csv('df{}.csv'.format(idx + 1))
So I have 366 CSV files and I want to copy their second columns and write them into a new CSV file. Need a code for this job. I tried some codes available here but nothing works. please help.
Assuming all the 2nd columns are the same length, you could simply loop through all the files. Read them, save the 2nd column to memory and construct a new df along the way.
filenames = ['test.csv', ....]
new_df = pd.DataFrame()
for filename in filenames:
df = pd.read_csv(filename)
second_column = df.iloc[:, 1]
new_df[f'SECOND_COLUMN_{filename.upper()}'] = second_column
del(df)
new_df.to_csv('new_csv.csv', index=False)
filenames = glob.glob(r'D:/CSV_FOLDER' + "/*.csv")
new_df = pd.DataFrame()
for filename in filenames:
df = pd.read_csv(filename)
second_column = df.iloc[:, 1]
new_df[f'SECOND_COLUMN_{filename.upper()}'] = second_column
del(df)
new_df.to_csv('new_csv.csv', index=False)
This can accomplished with glob and pandas:
import glob
import pandas as pd
mylist = [f for f in glob.glob("*.csv")]
df = pd.read_csv(mylist[0]) #create the dataframe from the first csv
df = pd.DataFrame(df.iloc[:,1]) #only keep 2nd column
for x in mylist[1:]: #loop through the rest of the csv files doing the same
t = pd.read_csv(x)
colName = pd.DataFrame(t.iloc[:,1]).columns
df[colName] = pd.DataFrame(t.iloc[:,1])
df.to_csv('output.csv', index=False)
import glob
import pandas as pd
mylist = [f for f in glob.glob("*.csv")]
df = pd.read_csv(csvList[0]) #create the dataframe from the first csv
df = pd.DataFrame(df.iloc[:,0]) #only keep 2nd column
for x in mylist[1:]: #loop through the rest of the csv files doing the same
t = pd.read_csv(x)
colName = pd.DataFrame(t.iloc[:,0]).columns
df[colName] = pd.DataFrame(t.iloc[:,0])
df.to_csv('output.csv', index=False)
I'm reading multiple csv files and combining them into a single dataframe like below:
pd.concat([pd.read_csv(f, encoding='latin-1') for f in glob.glob('*.csv')],
ignore_index=False, sort=False)
Problem:
I want to add a column that doesn't exist in any csv (to the dataframe) based on the csv file name for every csv file that is getting concatenated to the dataframe. Any help will be appreciated.
glob.glob returns normal string so you can just add a column to every individual dataframe in a loop.
Assuming you have files df1.csv and df2.csv in your directory:
import glob
import pandas as pd
files = glob.glob('df*csv')
dfs = []
for file in files:
df = pd.read_csv(file)
df['filename'] = file
dfs.append(df)
df = pd.concat(dfs, ignore_index=True)
df
a b filename
0 1 2 df1.csv
1 3 4 df1.csv
2 5 6 df2.csv
3 7 8 df2.csv
I have multiple csv files in my local directory. Each filename contains some numbers. Some of those numbers identify years for which the file is. I need to add a column year to each file that I'm concatenating and while I do I want to get the year information from the filename and insert it into that column. I'm using regex to extract the year and concatenate it like 20 + 11 = 2011. Then, I'm setting the column's data type to int32.
pd.concat(
[
pd.read_csv(f)
.assign(year = '20' + re.search('[a-z]+(?P<year>[0-9]{2})', f).group('year'))
.astype({'year' : 'int32'})
for f in glob.glob('stateoutflow*[0-9].csv')
],
ignore_index = True
)