Adding a column to dataframe while reading csv files [pandas] - python

I'm reading multiple csv files and combining them into a single dataframe like below:
pd.concat([pd.read_csv(f, encoding='latin-1') for f in glob.glob('*.csv')],
ignore_index=False, sort=False)
Problem:
I want to add a column that doesn't exist in any csv (to the dataframe) based on the csv file name for every csv file that is getting concatenated to the dataframe. Any help will be appreciated.

glob.glob returns normal string so you can just add a column to every individual dataframe in a loop.
Assuming you have files df1.csv and df2.csv in your directory:
import glob
import pandas as pd
files = glob.glob('df*csv')
dfs = []
for file in files:
df = pd.read_csv(file)
df['filename'] = file
dfs.append(df)
df = pd.concat(dfs, ignore_index=True)
df
a b filename
0 1 2 df1.csv
1 3 4 df1.csv
2 5 6 df2.csv
3 7 8 df2.csv

I have multiple csv files in my local directory. Each filename contains some numbers. Some of those numbers identify years for which the file is. I need to add a column year to each file that I'm concatenating and while I do I want to get the year information from the filename and insert it into that column. I'm using regex to extract the year and concatenate it like 20 + 11 = 2011. Then, I'm setting the column's data type to int32.
pd.concat(
[
pd.read_csv(f)
.assign(year = '20' + re.search('[a-z]+(?P<year>[0-9]{2})', f).group('year'))
.astype({'year' : 'int32'})
for f in glob.glob('stateoutflow*[0-9].csv')
],
ignore_index = True
)

Related

merge excel files into one based on specific columns

I need to merge multi-excel files based on a specific column as every file has two columns id and value and I need to merge all values from all files into one file next to each other. I tried this code but merged all the columns
cwd = os.path.abspath('/path/')
files = os.listdir(cwd)
df = pd.DataFrame()
for file in files:
if file.endswith('.xlsx'):
df = df.append(pd.read_excel('/path/' + file), ignore_index=True)
df.head()
df.to_excel('/path/merged.xlsx')
but got all values into a single column like
1 504.0303111
2 1587.678968
3 1437.759643
4 1588.387983
5 1059.194416
1 642.4925851
2 459.3774304
3 1184.210851
4 1660.24336
5 1321.414708
and I need values stored like
1 504.0303111 1 670.9609316
2 1587.678968 2 459.3774304
3 1437.759643 3 1184.210851
4 1588.387983 4 1660.24336
5 1059.194416 5 1321.414708
One way is to append the DataFrames to a list in loop and concatenate along the columns after the loop:
cwd = os.path.abspath('/path/')
files = os.listdir(cwd)
tmp = []
for i, file in enumerate(files[1:], 1):
if file.endswith('.xlsx'):
tmp.append(pd.read_excel('/path/' + file))
df = pd.concat(tmp, axis=1)
df.to_excel('/path/merged.xlsx')
But I feel like the following code would work better for you since it doesn't duplicate the id columns and only adds the value columns as new columns to a DataFrame df in loop:
cwd = os.path.abspath('/path/')
files = [file for file in os.listdir(cwd) if file.endswith('.xlsx')]
df = pd.read_excel('/path/' + files[0])
for i, file in enumerate(files[1:], 1):
df[f'value{i}'] = pd.read_excel('/path/' + file).iloc[:, 1]
df.to_excel('/path/merged.xlsx')

How do I bring the filename into the data frame with read_excel?

I have a directory of excel files that will continue to grow with weekly snapshots of the same data fields. Each file has a date stamp added to the file name (e.g. "_2021_09_30").
Here are my source files:
I have figured out how to read all of the excel files into a python data frame using the code below:
import os
import pandas as pd
cwd = os.path.abspath('NETWORK DRIVE DIRECTORY')
files = os.listdir(cwd)
df = pd.DataFrame()
for file in files:
if file.endswith('.xlsx'):
df = df.append(pd.read_excel(cwd+"/"+file), ignore_index=True)
df.head()
Since these files are snapshots of the same data fields, I want to be able to track how the underlying data changes from week to week. So I would like to add/include a column that has the filename so I can incorporate the date stamp in downstream analysis.
Any thoughts? Thank you in advance.
Welcome to StackOverflow! I agree with the comments that it's not exactly clear what you're looking for, so maybe clearing that up will help us be more helpful.
For example, with the filename "A_FILENAME_2020-01-23", do you want to use the name "A_FILENAME", or "A_FILENAME_2020-01-23"? Or are you not sure, because you're trying to think through how to track this downstream?
If the latter approach, this is what you would do for adding a new column:
for file in files:
if file.endswith('.xlsx'):
tmp = pd.read_excel(cwd+"/"+file)
tmp['filename'] = file
df = df.Append(tmp, ignore_index=True)
This would allow you to search the table by the starting of the 'filename' column, and pull the discrete data of each snapshot of the file side by side. Unfortuantely, this is a LOT of data.
If you ONLY want to store differences, you'd be able to use the .drop_duplicates function to try to drop based off a unique value that you use to decide whether there's a new, modified, or deleted row: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html
But, if you don't have a unique identifier for rows, that makes this quite a tough engineering problem. Very. Do you have a unique identifier you can use as your diffing strategy?
Extra code to split up the Files into separate columns for easier filtering later on (no harm in adding these, more columns the better I think):
from datetime import datetime
tmp['filename_stripped']=file[-11:]
tmp['filename_date']=datetime.strptime(file[:10], "%Y_%m_%d")
You could add additional column on the dataframe.
Modifying from your code
temp = pd.read_excel(cwd+"/"+file), ignore_index=True
temp['date'] = file[-11:]
df = df.append(temp)
You can use glob to easily combine xlsx or csv files into one dataframe. You just have to copy-paste your files' absolute path to where it says "/xlsx_path". You can also change read_excel to read_csv if you have csv files.
import pandas as pd
import glob
all_files = glob.glob(r'/xlsx_path' + "/*.xlsx")
file_list = [pd.read_excel(f) for f in all_files]
all_df = pd.concat(file_list, axis=0, ignore_index=True)
Alternatively you can use the one-liner below:
all_df = pd.concat(map(pd.read_excel, glob.glob('/xlsx_path/*.xlsx')))
Not sure what you really want, but related to tracking changes, let's say you have 2 excel files, you can track changes doing the following :
df1 = pd.read_excel("file-1.xlsx")
df1
values
0 aa
1 bb
2 cc
3 dd
4 ee
df2 = pd.read_excel("file-2.xlsx")
df2
values
0 aa
1 bb
2 cc
3 ddd
4 e
..and generate a new dataframe having rows that have changed between your 2 files :
df = pd.concat([df1, df2])
df = df.reset_index(drop=True)
new_df = df.groupby(list(df.columns))
diff = [x[0] for x in new_df.groups.values() if len(x) == 1]
df.reindex(diff)
Output :
values
0 dd
1 ddd
2 e
3 ee

Python: read through multiple but not all csv files in my folder

I want to read some csv files from my folder and concatenate them to a big pandas dataframe. All of my csv files end with a number, and I only want to read files whose number end with (6~10, 16~20, 26~30.) My goal is to read the files iteratively. Attached is my code so far:
import pandas as pd
data_one = pd.read_csv('Datafile-6.csv', header=None)
for i in range(7,11):
data99 = pd.read_csv('Datafile-'+i+'*.csv', header=None) #this line needs work
data_one = pd.concat([data_one, data99.iloc[:,1]],axis=1,ignore_index=True)
data_two = pd.read_csv('Datafile-16.csv', header=None)
for j in range(17,21):
#Repeat similar process
What should I do about 'data99' such that 'data_one' contains columns from 'Datafile-6' through 'Datafile-10'?
The first five rows of data_one should look like this, after getting data from Datafiles 6-10.
0 1 2 3 4 5
0 -40.0 0.179836 0.179630 0.179397 0.179192 0.179031
1 -39.0 0.183696 0.183441 0.183204 0.182977 0.182795
2 -38.0 0.186720 0.186446 0.186191 0.185949 0.185762
3 -37.0 0.189490 0.189207 0.188935 0.188686 0.188475
4 -36.0 0.192154 0.191851 0.191569 0.191301 0.191086
Column 0 is included in all of the data files, so I'm only concatenating column 1 of all of the subsequent data files.
You need to use glob module:
import glob, os
import pandas as pd
path =r'C:\YourFolder' #path to folder with .csv files
all = glob.glob(path + "/*.csv")
d_frame = pd.DataFrame()
list_ = []
for file_ in all:
df = pd.read_csv(file_,index_col=None, header=0)
if df['YourColumns'].tail(1).isin([6,7,8,9,10,16,17,18,19,20,26,27,28,29,30]) == True: #You can modify list with conditions you need
list_.append(df)
d_frame = pd.concat(list_)

Find if a column value exists in multiple dataframes

I have 4 excel files - 'a1.xlsx','a2.xlsx','a3.xlsx','a4.xlsx'
The format of the files are same
for eg a1.xlsx looks like:
id code name
1 100 abc
2 200 zxc
... ... ...
i have to read this files in pandas dataframe and check whether the same value of code column exists in multiple excel files or not.
something like this.
if code=100 exists in 'a1.xlsx','a3.xlsx' , and code=200 exists only in 'a1.xlsx'
final dataframe should look like:
code filename
100 a1.xlsx,a3.xlsx
200 a1.xlsx
... ....
and so on
I have all the files in a directory and tried to iterate them through loop
import pandas as pd
import os
x = next(os.walk('path/to/files/'))[2] #list all files in directory
os.chdir('path/to/files/')
for i in range (0,len(x)):
df = pd.read_excel(x[i])
How to proceed? any leads?
Use:
import glob
#get all filenames
files = glob.glob('path/to/files/*.xlsx')
#list comprehension with assign new column for filenames
dfs = [pd.read_excel(fp).assign(filename=os.path.basename(fp).split('.')[0]) for fp in files]
#one big df from list of dfs
df = pd.concat(dfs, ignore_index=True)
#join all same codes
df1 = df.groupby('code')['filename'].apply(', '.join).reset_index()

Create a new dataframe out of dozens of df.sum() series

I have several pandas DataFrames of the same format, with five columns.
I would like to sum the values of each one of these dataframes using df.sum(). This will create a Series for each Dataframe, still with 5 columns.
My problem is how to take these Series, and create another Dataframe, one column being the filename, the other columns being the five columns above from df.sum()
import pandas as pd
import glob
batch_of_dataframes = glob.glob("*.txt")
newdf = []
for filename in batch_of_dataframes:
df = pd.read_csv(filename)
df['filename'] = str(filename)
df = df.sum()
newdf.append(df)
newdf = pd.concat(newdf, ignore_index=True)
This approach doesn't work unfortunately. 'df['filename'] = str(filename)' throws a TypeError, and the creating a new dataframe newdf doesn't parse correctly.
How would one do this correctly?
How do you take a number of pandas.Series objects and create a DataFrame?
Try in this order:
Create an empty list, say list_of_series.
For every file:
load into a data frame, then save the sum in a series s
add an element to s: s['filename'] = your_filename
append s to list_of_series
Finally, concatenate (and transpose if needed):
final_df = pd.concat(list_of_series, axis = 1).T
Code
Preparation:
l_df = [pd.DataFrame(np.random.rand(3,5), columns = list("ABCDE")) for _ in range(5)]
for i, df in enumerate(l_df):
df.to_csv(str(i)+'.txt', index = False)
Files *.txt are comma separated and contain headers.
! cat 1.txt
A,B,C,D,E
0.18021800981245173,0.29919271590063656,0.09527248614484807,0.9672038093199938,0.07655003742768962
0.35422759068109766,0.04184770882952815,0.682902924462214,0.9400817219440063,0.8825581077493059
0.3762875793116358,0.4745731412494566,0.6545473610147845,0.7479829630649761,0.15641907539706779
And, indeed, the rest is quite similar to what you did (I append file names to a series, not to data frames. Otherwise they got concatenated several times by sum()):
files = glob.glob('*.txt')
print(files)
['3.txt', '0.txt', '4.txt', '2.txt', '1.txt']
list_of_series = []
for f in files:
df = pd.read_csv(f)
s = df.sum()
s['filename'] = f
list_of_series.append(s)
final_df = pd.concat(list_of_series, axis = 1).T
print(final_df)
A B C D E filename
0 1.0675 2.20957 1.65058 1.80515 2.22058 3.txt
1 0.642805 1.36248 0.0237625 1.87767 1.63317 0.txt
2 1.68678 1.26363 0.835245 2.05305 1.01829 4.txt
3 1.22748 2.09256 0.785089 1.87852 2.05043 2.txt
4 0.910733 0.815614 1.43272 2.65527 1.11553 1.txt
To answer this specific question :
#ThomasTu How do I go from a list of Series with 'Filename' as a
column to a dataframe? I think that's the problem---I don't understand
this
It's essentially what you have now, but instead of appending to an empty list, you append to an empty dataframe. I think there's an inplace keyword if you don't want to reassign newdf on each iteration.
import pandas as pd
import glob
batch_of_dataframes = glob.glob("*.txt")
newdf = pd.DataFrame()
for filename in batch_of_dataframes:
df = pd.read_csv(filename)
df['filename'] = str(filename)
df = df.sum()
newdf = newdf.append(df, ignore_index=True)

Categories

Resources