Python Read all the sheet and combine - python

I tried to concatenate all the sheets in the excel file without leaving NaN in other sheets
import pandas as pd
excel_file = "C:/Users/User/Documents/UiPath/Endo Bot/endoProcess/NEW ENDO PASTE HERE/-r1- (07-23-2020).xlsx"
fil = pd.ExcelFile(excel_file)
names = fil.sheet_names
df = pd.concat([fil.parse(name) for name in names])
print(df)
Looks like it only appends the sheets to the first sheet.
The result:
COUNT NAME Number count2
0 4.0 kiko NaN NaN
1 5.0 esmer NaN NaN
2 6.0 jason NaN NaN
0 NaN NaN 9.0 23.0
1 NaN NaN 10.0 13.0
2 NaN NaN 11.0 14.0
The result that I want:
COUNT NAME Number count2
0 4.0 kiko 9.0 23.0
1 5.0 esmer 10.0 13.0
2 6.0 jason 11.0 14.0

Concatenate on axis 1 (columns) instead of axis 0 (index, the default), like so: df = pd.concat([fil.parse(name) for name in names], axis=1).
Code
import pandas as pd
excel_file = "C:/Users/User/Documents/UiPath/Endo Bot/endoProcess/NEW ENDO PASTE HERE/-r1- (07-23-2020).xlsx"
fil = pd.ExcelFile(excel_file)
names = fil.sheet_names
# concatenated
df = pd.concat([fil.parse(name) for name in names], axis=1)
print(df)
Output
COUNT NAME Number count2
0 4 kiko 9 23
1 5 esmer 10 13
2 6 jason 11 14

Related

Python Dataframe Duplicated Columns while Merging multple times

I have a main dataframe and a sub dataframe. I want to merge each column in sub dataframe into main dataframe with main dataframe column as a reference. I have successfully arrived at my desired answer, except that I see duplicated columns of the main dataframe. Below are the my expected and present answers.
Present solution:
df = pd.DataFrame({'Ref':[1,2,3,4]})
df1 = pd.DataFrame({'A':[2,3],'Z':[1,2]})
df = [df.merge(df1[col_name],left_on='Ref',right_on=col_name,how='left') for col_name in df1.columns]
df = pd.concat(df,axis=1)
df =
Ref A Ref Z
0 1 NaN 1 1.0
1 2 2.0 2 2.0
2 3 3.0 3 NaN
3 4 NaN 4 NaN
Expected Answer:
df =
Ref A Z
0 1 NaN 1.0
1 2 2.0 2.0
2 3 3.0 NaN
3 4 NaN NaN
Update
Use duplicated:
>>> df.loc[:, ~df.columns.duplicated()]
Ref A Z
0 1 NaN 1.0
1 2 2.0 2.0
2 3 3.0 NaN
3 4 NaN NaN
Old answer
You can use:
# Your code
...
df = pd.concat(df, axis=1)
# Use pop and insert to cleanup your dataframe
df.insert(0, 'Ref', df.pop('Ref').iloc[:, 0])
Output:
>>> df
Ref A Z
0 1 NaN 1.0
1 2 2.0 2.0
2 3 3.0 NaN
3 4 NaN NaN
What about setting 'Ref' col as index while getting dataframe list. (And resetting index such that you get back Ref as a column)
df = pd.DataFrame({'Ref':[1,2,3,4]})
df1 = pd.DataFrame({'A':[2,3],'Z':[1,2]})
df = [df.merge(df1[col_name],left_on='Ref',right_on=col_name,how='left').set_index('Ref') for col_name in df1.columns]
df = pd.concat(df,axis=1)
df = df.reset_index()
Ref A Z
1 NaN 1.0
2 2.0 2.0
3 3.0 NaN
4 NaN NaN
This is a reduction process. Instead of the list comprehension use for - loop, or even reduce:
from functools import reduce
reduce(lambda x, y : x.merge(df1[y],left_on='Ref',right_on=y,how='left'), df1.columns, df)
Ref A Z
0 1 NaN 1.0
1 2 2.0 2.0
2 3 3.0 NaN
3 4 NaN NaN
The above is similar to:
for y in df1.columns:
df = df.merge(df1[y],left_on='Ref',right_on=y,how='left')
df
Ref A Z
0 1 NaN 1.0
1 2 2.0 2.0
2 3 3.0 NaN
3 4 NaN NaN

Reset index without multiple headers after pivot in pandas

I have this DataFrame
df = pd.DataFrame({'store':[1,1,1,2],'upc':[11,22,33,11],'sales':[14,16,11,29]})
which gives this output
store upc sales
0 1 11 14
1 1 22 16
2 1 33 11
3 2 11 29
I want something like this
store upc_11 upc_22 upc_33
1 14.0 16.0 11.0
2 29.0 NaN NaN
I tried this
newdf = df.pivot(index='store', columns='upc')
newdf.columns = newdf.columns.droplevel(0)
and the output looks like this with multiple headers
upc 11 22 33
store
1 14.0 16.0 11.0
2 29.0 NaN NaN
I also tried
newdf = df.pivot(index='store', columns='upc').reset_index()
This also gives multiple headers
store sales
upc 11 22 33
0 1 14.0 16.0 11.0
1 2 29.0 NaN NaN
try via fstring+columns attribute and list comprehension:
newdf = df.pivot(index='store', columns='upc')
newdf.columns=[f"upc_{y}" for x,y in newdf.columns]
newdf=newdf.reset_index()
OR
In 2 steps:
newdf = df.pivot(index='store', columns='upc').reset_index()
newdf.columns=[f"upc_{y}" if y!='' else f"{x}" for x,y in newdf.columns]
Another option, which is longer than #Anurag's:
(df.pivot(index='store', columns='upc')
.droplevel(axis=1, level=0)
.rename(columns = lambda df: f"upc_{df}")
.rename_axis(index=None, columns=None)
)
upc_11 upc_22 upc_33
1 14.0 16.0 11.0
2 29.0 NaN NaN

'DataFrame' object has no attribute 'string_column'

I am trying to count the delimiter from my CSV file using this piece of code:
import pandas as pd
df = pd.read_csv(path,sep=',')
df['comma_count'] = df.string_column.str.count(',')
print (df)
But I keep getting this error:
'DataFrame' object has no attribute 'string_column'.
Trying to iterate through my dataframe had no avail.
I am trying to achieve this:
id val new comma_count
id val new 2
0 a 2.0 234.0 2
1 a 5.0 432.0 2
2 a 4.0 234.0 2
3 a 2.0 23423.0 2
4 a 9.0 324.0 2
5 a 7.0 NaN 1
6 NaN 234.0 NaN 1
7 a 6.0 NaN 1
8 4 NaN NaN 0
My file:
id,val,new
a,2,234
a,5,432
a,4,234
a,2,23423
a,9,324
a,7
,234
a,6,
4
Use different separator with select first column and count:
df1 = pd.read_csv(path,sep='|')
df['dot_count'] = df1.iloc[:, 0].str.count(',')

Add Series to DataFrame with additional index values

I have a DataFrame which looks like this:
Value
1 23
2 12
3 4
And a Series which looks like this:
1 24
2 12
4 34
Is there a way to add the Series to the DataFrame to obtain a result which looks like this:
Value New
1 23 24
2 12 12
3 4 0
4 0 34
Using concat(..., axis=1) and .fillna():
import pandas as pd
df = pd.DataFrame([23,12,4], columns=["Value"], index=[1,2,3])
s = pd.Series([24,12,34],index=[1,2,4], name="New")
df = pd.concat([df,s],axis=1)
print(df)
df = df.fillna(0) # or df.fillna(0, inplace=True)
print(df)
Output:
Value New
1 23.0 24.0
2 12.0 12.0
3 4.0 NaN
4 NaN 34.0
# If replacing NaNs with 0:
Value New
1 23.0 24.0
2 12.0 12.0
3 4.0 NaN
4 NaN 34.0
You can use join between a series and a dataframe:
my_df.join(my_series, how='outer').fillna(0)
Example:
>>> df
Value
1 23
2 12
3 4
>>> s
0
1 24
2 12
4 34
>>> type(df)
<class 'pandas.core.frame.DataFrame'>
>>> type(s)
<class 'pandas.core.series.Series'>
>>> df.join(s, how='outer').fillna(0)
Value 1
1 23.0 24.0
2 12.0 12.0
3 4.0 0.0
4 0.0 34.0

Unmelt Pandas DataFrame

I have a pandas dataframe with two id variables:
df = pd.DataFrame({'id': [1,1,1,2,2,3],
'num': [10,10,12,13,14,15],
'q': ['a', 'b', 'd', 'a', 'b', 'z'],
'v': [2,4,6,8,10,12]})
id num q v
0 1 10 a 2
1 1 10 b 4
2 1 12 d 6
3 2 13 a 8
4 2 14 b 10
5 3 15 z 12
I can pivot the table with:
df.pivot('id','q','v')
And end up with something close:
q a b d z
id
1 2 4 6 NaN
2 8 10 NaN NaN
3 NaN NaN NaN 12
However, what I really want is (the original unmelted form):
id num a b d z
1 10 2 4 NaN NaN
1 12 NaN NaN 6 NaN
2 13 8 NaN NaN NaN
2 14 NaN 10 NaN NaN
3 15 NaN NaN NaN 12
In other words:
'id' and 'num' my indices (normally, I've only seen either 'id' or 'num' being the index but I need both since I'm trying to retrieve the original unmelted form)
'q' are my columns
'v' are my values in the table
Update
I found a close solution from Wes McKinney's blog:
df.pivot_table(index=['id','num'], columns='q')
v
q a b d z
id num
1 10 2 4 NaN NaN
12 NaN NaN 6 NaN
2 13 8 NaN NaN NaN
14 NaN 10 NaN NaN
3 15 NaN NaN NaN 12
However, the format is not quite the same as what I want above.
You could use set_index and unstack
In [18]: df.set_index(['id', 'num', 'q'])['v'].unstack().reset_index()
Out[18]:
q id num a b d z
0 1 10 2.0 4.0 NaN NaN
1 1 12 NaN NaN 6.0 NaN
2 2 13 8.0 NaN NaN NaN
3 2 14 NaN 10.0 NaN NaN
4 3 15 NaN NaN NaN 12.0
You're really close slaw. Just rename your column index to None and you've got what you want.
df2 = df.pivot_table(index=['id','num'], columns='q')
df2.columns = df2.columns.droplevel().rename(None)
df2.reset_index().fillna("null").to_csv("test.csv", sep="\t", index=None)
Note that the the 'v' column is expected to be numeric by default so that it can be aggregated. Otherwise, Pandas will error out with:
DataError: No numeric types to aggregate
To resolve this, you can specify your own aggregation function by using a custom lambda function:
df2 = df.pivot_table(index=['id','num'], columns='q', aggfunc= lambda x: x)
you can remove name q.
df1.columns=df1.columns.tolist()
Zero's answer + remove q =
df1 = df.set_index(['id', 'num', 'q'])['v'].unstack().reset_index()
df1.columns=df1.columns.tolist()
id num a b d z
0 1 10 2.0 4.0 NaN NaN
1 1 12 NaN NaN 6.0 NaN
2 2 13 8.0 NaN NaN NaN
3 2 14 NaN 10.0 NaN NaN
4 3 15 NaN NaN NaN 12.0
This might work just fine:
Pivot
df2 = (df.pivot_table(index=['id', 'num'], columns='q', values='v')).reset_index())
Concatinate the 1st level column names with the 2nd
df2.columns =[s1 + str(s2) for (s1,s2) in df2.columns.tolist()]
Came up with a close solution
df2 = df.pivot_table(index=['id','num'], columns='q')
df2.columns = df2.columns.droplevel()
df2.reset_index().fillna("null").to_csv("test.csv", sep="\t", index=None)
Still can't figure out how to drop 'q' from the dataframe
It can be done in three steps:
#1: Prepare auxilary column 'id_num':
df['id_num'] = df[['id', 'num']].apply(tuple, axis=1)
df = df.drop(columns=['id', 'num'])
#2: 'pivot' is almost an inverse of melt:
df, df.columns.name = df.pivot(index='id_num', columns='q', values='v').reset_index(), ''
#3: Bring back 'id' and 'num' columns:
df['id'], df['num'] = zip(*df['id_num'])
df = df.drop(columns=['id_num'])
This is a result, but with different order of columns:
a b d z id num
0 2.0 4.0 NaN NaN 1 10
1 NaN NaN 6.0 NaN 1 12
2 8.0 NaN NaN NaN 2 13
3 NaN 10.0 NaN NaN 2 14
4 NaN NaN NaN 12.0 3 15
Alternatively with proper order:
def multiindex_pivot(df, columns=None, values=None):
#inspired by: https://github.com/pandas-dev/pandas/issues/23955
names = list(df.index.names)
df = df.reset_index()
list_index = df[names].values
tuples_index = [tuple(i) for i in list_index] # hashable
df = df.assign(tuples_index=tuples_index)
df = df.pivot(index="tuples_index", columns=columns, values=values)
tuples_index = df.index # reduced
index = pd.MultiIndex.from_tuples(tuples_index, names=names)
df.index = index
df = df.reset_index() #me
df.columns.name = '' #me
return df
df = df.set_index(['id', 'num'])
df = multiindex_pivot(df, columns='q', values='v')

Categories

Resources