Combine data frames and add file name with PANDAS - python

I have a question about PANDAS.
I have many dataframes like below.
I would like to combine these dataframes and add file name as right figure.
Does anyone know how to do that?

I think you need concat of list of DataFrames with parameter keys for df names, then remove MultiIndex and create new column File:
dfs = [df1,df2, df3]
df = pd.concat(dfs, keys=range(1, len(dfs) + 1))
.reset_index(level=1, drop=True)
.rename_axis('File')
.reset_index()
Sample:
df1 = pd.DataFrame({'Product':['a','b','c'],
'Price':[4,5,6]})
print (df1)
Price Product
0 4 a
1 5 b
2 6 c
df2 = pd.DataFrame({'Product':['d','e','g'],
'Price':[9,8,7]})
print (df2)
Price Product
0 9 d
1 8 e
2 7 g
df3 = pd.DataFrame({'Product':['f','z','h'],
'Price':[1,2,4]})
print (df3)
Price Product
0 1 f
1 2 z
2 4 h
dfs = [df1,df2, df3]
df = pd.concat(dfs, keys=range(1, len(dfs) + 1)) \
.reset_index(level=1, drop=True) \
.rename_axis('File').reset_index()
print (df)
File Price Product
0 1 4 a
1 1 5 b
2 1 6 c
3 2 9 d
4 2 8 e
5 2 7 g
6 3 1 f
7 3 2 z
8 3 4 h
You can also use custom names from list:
dfs = [df1,df2,df3]
names = ['file1','file2','file3']
df = pd.concat(dfs, keys=names)
df = df.reset_index(level=1, drop=True).rename_axis('File').reset_index()
print (df)
File Price Product
0 file1 4 a
1 file1 5 b
2 file1 6 c
3 file2 9 d
4 file2 8 e
5 file2 7 g
6 file3 1 f
7 file3 2 z
8 file3 4 h

Related

Delete pandas column if column name begins with a number

I have a pandas DataFrame with about 200 columns. Roughly, I want to do this
for col in df.columns:
if col begins with a number:
df.drop(col)
I'm not sure what are the best practices when it comes to handling pandas DataFrames, how should I handle this? Will my pseudocode work, or is it not recommended to modify a pandas dataframe in a for loop?
I think simpliest is select all columns which not starts with number by filter with regex - ^ is for start of string and \D is for not number:
df1 = df.filter(regex='^\D')
Similar alternative:
df1 = df.loc[:, df.columns.str.contains('^\D')]
Or inverse condition and select numbers:
df1 = df.loc[:, ~df.columns.str.contains('^\d')]
df1 = df.loc[:, ~df.columns.str[0].str.isnumeric()]
If want use your pseudocode:
for col in df.columns:
if col[0].isnumeric():
df = df.drop(col, axis=1)
Sample:
df = pd.DataFrame({'2A':list('abcdef'),
'1B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D3':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
1B 2A C D3 E F
0 4 a 7 1 5 a
1 5 b 8 3 3 a
2 4 c 9 5 6 a
3 5 d 4 7 9 b
4 5 e 2 1 2 b
5 4 f 3 0 4 b
df1 = df.filter(regex='^\D')
print (df1)
C D3 E F
0 7 1 5 a
1 8 3 3 a
2 9 5 6 a
3 4 7 9 b
4 2 1 2 b
5 3 0 4 b
An alternative can be this:
columns = [x for x in df.columns if not x[0].isdigit()]
df = df[columns]

Pandas: Concatenate files but skip the headers except the first file

I have 3 files representing the same dataset split in 3 and I need to concatenate:
import pandas
df1 = pandas.read_csv('path1')
df2 = pandas.read_csv('path2')
df3 = pandas.read_csv('path3')
df = pandas.concat([df1,df2,df3])
But this will keep the headers in the middle of the dataset, I need to remove the headers (column names) from the 2nd and 3rd file. How do I do that?
I think you need numpy.concatenate with DataFrame constructor:
df = pd.DataFrame(np.concatenate([df1.values, df2.values, df3.values]), columns=df1.columns)
Another solution is replace columns names in df2 and df3:
df2.columns = df1.columns
df3.columns = df1.columns
df = pd.concat([df1,df2,df3], ignore_index=True)
Samples:
np.random.seed(100)
df1 = pd.DataFrame(np.random.randint(10, size=(2,3)), columns=list('ABF'))
print (df1)
A B F
0 8 8 3
1 7 7 0
df2 = pd.DataFrame(np.random.randint(10, size=(1,3)), columns=list('ERT'))
print (df2)
E R T
0 4 2 5
df3 = pd.DataFrame(np.random.randint(10, size=(3,3)), columns=list('HTR'))
print (df3)
H T R
0 2 2 2
1 1 0 8
2 4 0 9
print (np.concatenate([df1.values, df2.values, df3.values]))
[[8 8 3]
[7 7 0]
[4 2 5]
[2 2 2]
[1 0 8]
[4 0 9]]
df = pd.DataFrame(np.concatenate([df1.values, df2.values, df3.values]), columns=df1.columns)
print (df)
A B F
0 8 8 3
1 7 7 0
2 4 2 5
3 2 2 2
4 1 0 8
5 4 0 9
df = pd.concat([df1,df2,df3], ignore_index=True)
print (df)
A B F
0 8 8 3
1 7 7 0
2 4 2 5
3 2 2 2
4 1 0 8
5 4 0 9
You have to use argument skip_rows of read_csv for second and third lines like here:
import pandas
df1 = pandas.read_csv('path1')
df2 = pandas.read_csv('path2', skiprows=1)
df3 = pandas.read_csv('path3', skiprows=1)
df = pandas.concat([df1,df2,df3])
Been working on this recently myself, here's the most compact/elegant thing I came up with:
import pandas as pd
frame_list=[df1, df2, df3]
frame_mod=[frame_list[i].iloc[0:] for i in range(0,len(frame_list))]
frame_frame=pd.concat(frame_mod)
Use:
df = pd.merge(df1, df2, how='outer')
Merge rows that appear in either or both df1 and df2 (union).

Assigning dataframe data to dataframe label using a loop

I have a list of dataframe names that I would like to assign different dataframe data to.
filenames =[]
for i in np.arange(1,7):
a = "C:\Users\...........\Python code\Cp error for MPE MR%s.csv" %(i)
filenames.append(a)
dfs =[df1,df2,df3,df4,df5,df6]
for i, j in enumerate(filenames):
dfs[j]= pd.DataFrame.from_csv(i,header=0, index_col=None)
However, the following error code occurs:
NameError: name 'df1' is not defined
Is there something wrong with the way I am defining the list of values? Why can't a value in a list be assigned as a variable?
how can i put the following code in a loop?
df1 = pd.DataFrame.from_csv(filenames[0],header=0, index_col=None)
df2 = pd.DataFrame.from_csv(filenames[1],header=0, index_col=None)
df3 = pd.DataFrame.from_csv(filenames[2],header=0, index_col=None)
df4 = pd.DataFrame.from_csv(filenames[3],header=0, index_col=None)
df5 = pd.DataFrame.from_csv(filenames[4],header=0, index_col=None)
df6 = pd.DataFrame.from_csv(filenames[5],header=0, index_col=None)
It seems you need dict comprehension, one possible way for list of files is use glob:
Sample files:
a.csv, b.csv, c.csv.
files = glob.glob('files/*.csv')
#windows solution for files names - os.path.splitext(os.path.split(fp)[1])
dfs = {os.path.splitext(os.path.split(fp)[1])[0]:pd.read_csv(fp) for fp in files}
print (dfs)
{'b': a b c d
0 0 9 6 5
1 1 6 4 2, 'a': a b c d
0 0 1 2 5
1 1 5 8 3, 'c': a b c d
0 0 7 1 7
1 1 3 2 6}
print (dfs['a'])
a b c d
0 0 1 2 5
1 1 5 8 3
If same columns in each files is possible create one big df by concat:
df = pd.concat(dfs)
print (df)
a b c d
a 0 0 1 2 5
1 1 5 8 3
b 0 0 9 6 5
1 1 6 4 2
c 0 0 7 1 7
1 1 3 2 6
EDIT:
Better instead pd.DataFrame.from_csv is use read_csv:
Solution with global variables:
#for df0, df1, df2...
for i, fp in enumerate(files):
print (fp)
df = pd.read_csv(fp, header=0, index_col=None)
globals()['df' + str(i)] = df
print (df1)
a b c d
0 0 9 6 5
1 1 6 4 2
Better solution for list of DataFrames and selecting by positions:
#for dfs[0], dfs[1], dfs[2]...
dfs = [pd.read_csv(fp, header=0, index_col=None) for fp in files]
print (dfs[1])
a b c d
0 0 9 6 5
1 1 6 4 2
dfs =[df1,df2,df3,df4,df5,df6]?
Why this string? Why not it should be:
dfs =[]
And yes, I think you swapped i and j, and it should be something like:
dfs.append(pd.DataFrame.from_csv(j,header=0, index_col=None))
And enumerate is redundant:
for f in filenames:
dfs.append(pd.DataFrame.from_csv(f,header=0, index_col=None))

How to extract rows in a pandas dataframe NOT in a subset dataframe

I have two dataframes. DF and SubDF. SubDF is a subset of DF. I want to extract the rows in DF that are NOT in SubDF.
I tried the following:
DF2 = DF[~DF.isin(SubDF)]
The number of rows are correct and most rows are correct,
ie number of rows in subDF + number of rows in DF2 = number of rows in DF
but I get rows with NaN values that do not exist in the original DF
Not sure what I'm doing wrong.
Note: the original DF does not have any NaN values, and to double check I did DF.dropna() before and the result still produced NaN
You need merge with outer join and boolean indexing, because DataFrame.isin need values and index match:
DF = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (DF)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
SubDF = pd.DataFrame({'A':[3],
'B':[6],
'C':[9],
'D':[5],
'E':[6],
'F':[3]})
print (SubDF)
A B C D E F
0 3 6 9 5 6 3
#return no match
DF2 = DF[~DF.isin(SubDF)]
print (DF2)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
DF2 = pd.merge(DF, SubDF, how='outer', indicator=True)
DF2 = DF2[DF2._merge == 'left_only'].drop('_merge', axis=1)
print (DF2)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
Another way, borrowing the setup from #jezrael:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
sub = pd.DataFrame({'A':[3],
'B':[6],
'C':[9],
'D':[5],
'E':[6],
'F':[3]})
extract_idx = list(set(df.index) - set(sub.index))
df_extract = df.loc[extract_idx]
The rows may not be sorted in the original df order. If matching order is required:
extract_idx = list(set(df.index) - set(sub.index))
idx_dict = dict(enumerate(df.index))
order_dict = dict(zip(idx_dict.values(), idx_dict.keys()))
df_extract = df.loc[sorted(extract_idx, key=order_dict.get)]

How to append several data frame into one

I have write down a code to append several dummy DataFrame into one. After appending, the expected "DataFrame.shape" would be (9x3). But my code producing something unexpected output (6x3). How can i rectify the error of my code.
import pandas as pd
a = [[1,2,4],[1,3,4],[2,3,4]]
b = [[1,1,1],[1,6,4],[2,9,4]]
c = [[1,3,4],[1,1,4],[2,0,4]]
d = [[1,1,4],[1,3,4],[2,0,4]]
df1 = pd.DataFrame(a,columns=["a","b","c"])
df2 = pd.DataFrame(b,columns=["a","b","c"])
df3 = pd.DataFrame(c,columns=["a","b","c"])
for df in (df1, df2, df3):
df = df.append(df, ignore_index=True)
print df
I don't want use "pd.concat" because in this case i have to store all the data frame into memory and my real data set contains hundred of data frame with huge shape. I just want a code which can open one CSV file at once into loop update the final DF with the progress of loop
thanks
Firstly use concat to concatenate a bunch of dfs it's quicker:
In [308]:
df = pd.concat([df1,df2,df3], ignore_index=True)
df
Out[308]:
a b c
0 1 2 4
1 1 3 4
2 2 3 4
3 1 1 1
4 1 6 4
5 2 9 4
6 1 3 4
7 1 1 4
8 2 0 4
secondly you're reusing the iterable in your loop which is why it overwrites it, if you did this it would work:
In [307]:
a = [[1,2,4],[1,3,4],[2,3,4]]
b = [[1,1,1],[1,6,4],[2,9,4]]
c = [[1,3,4],[1,1,4],[2,0,4]]
d = [[1,1,4],[1,3,4],[2,0,4]]
​
​
df1 = pd.DataFrame(a,columns=["a","b","c"])
df2 = pd.DataFrame(b,columns=["a","b","c"])
df3 = pd.DataFrame(c,columns=["a","b","c"])
​
df = pd.DataFrame()
​
for d in (df1, df2, df3):
df = df.append(d, ignore_index=True)
df
Out[307]:
a b c
0 1 2 4
1 1 3 4
2 2 3 4
3 1 1 1
4 1 6 4
5 2 9 4
6 1 3 4
7 1 1 4
8 2 0 4
Here I changed the iterable to be d and declared an empty df outside the loop:
df = pd.DataFrame()
​
for d in (df1, df2, df3):
df = df.append(d, ignore_index=True)

Categories

Resources