Change column names except for certain columns - python

Assuming I have the following dataframe
df = pd.DataFrame(
{
'ID': ['AB01'],
'Col A': ["Yes"],
'Col B': ["L"],
'Col C': ["Yes"],
'Col D': ["L"],
'Col E': ["Yes"],
'Col F': ["L"],
'Type': [85]
}
)
I want to change all column names by changing it lowercase, replace space with underscore and adding string _filled to the end of name, except for columns named in list skip = ['ID', 'Type'].
How can I achieve this? I want the end resulting dataframe to have column names as ID, col_a_filled, col_b_filled......,Type

You can use df.rename along with a dict comprehension to get a nice one-liner:
df = df.rename(columns={col:col.lower().replace(" ", "_")+"_filled" for col in df.columns if col not in skip})

Related

Exporting a python dataframe + text element to an image

I want to print a data frame as a png image, and followed the following approach.
import pandas as pd
import dataframe_image as dfi
data = {'Type': ['Type 1', 'Type 2', 'Type 3', 'Total'], 'Value': [20, 21, 19, 60]}
df = pd.DataFrame(data)
dfi.export(df, 'table.png')
I however want to also print a date stamp above the table on the image - with the intention of creating a series of images on consecutive days. If possible I would also like to format the table with a horizontal line indicating the summation of values for the final 'Total' row.
Is this possible with the above package? Or is there a better approach to do this?
You can add the line df.index.name = pd.Timestamp('now').replace(microsecond=0) to add the timestamp on the first row:
To add the line you can use .style.set_table_styles:
data = {'Type': ['Type 1', 'Type 2', 'Type 3'], 'Value': [20, 21, 19]}
df = pd.DataFrame(data)
df.index.name = pd.Timestamp('now').replace(microsecond=0)
df.loc[len(df)] = ['Total',df['Value'].sum()]
test = df.style.set_table_styles([{'selector' : '.row3','props' : [('border-top','3px solid black')]}])
dfi.export(test, 'table.png')

If match found then add to dictionary, otherwise perform a process extract one fuzz match

I have a 2 dataframes that Im comparing and then adding the results to a dictionary.
I can get the first batch of results to work but when I add in the else statement thats when things go bad
Right now, it appears to run forever. Im new to dictionaries and looping through dataframes.
Here's my code so far (and please note that it doesnt work) :
Also please note that Im using the tuple output of process.extract (address, score, index) and I created a separate dataframe that Im matching on the index and taking the value of that index and putting it as a item in my dictionary.
Here's my variables:
df1 = pd.DataFrame(np.random.randint(0,100,size=(100, 2)), columns=['Address1','Type'])
df2 = pd.DataFrame(np.random.randint(0,100,size=(100, 3)), columns=['Address2','Type', 'ID']) (I want to grab the ID from this DF)
id_match = df['ID'].to_dict()
resultstest= defaultdict(list)
matched = [process.extract(i,df2['Address_2'], limit=1)[0] for i in df1['Address_1']]
df2 has aprox 80k rows &
df1 has ~50 rows
Here's the code:
for lrow in df1.itertuples():
for vrow in df2.itertuples():
if lrow.Address_1 == vrow.Address_2:
resultstest[lrow.Address_PSL].append({'ID' : vrow._1, 'Address_match': vrow.Address_2,
'Sales' : vrow._6, 'Calls': vrow._7, 'Target': vrow._8,
'Type': vrow._5})
break #match found done
else:
for z, y in id_match.items():
for m in matched:
if z == m[2] :#matching on*indexs*
print(z)
resultstest[lrow.Address_1].append({' ID' : y, 'Address_match': m[0](score, \
'Fuzz_Score': m[1]})
break
#m[0] is the address, m[1] is the score
My output would be something like this:
defaultdict(list,
{'address xyz': [{' ID': '1111111',
'Address_match': 'address xyz',
'Sales': nan,
'Calls': nan,
'Target': 0.0,
'ID_Type': 'X'}],
{'address abc': [{' ID': '11112222',
'Address_match': 'address abc',
'Sales': nan,
'Calls': nan,
'Target': 0.0,
'ID_Type': 'Y'}],
{'address xyz12345':[{'ID': '1231569',
'Address_match': 'address xyz12345',
'Fuzz_Score': 97}]})

How to iterate several dataframe row by row and modified one of them

I am stuck with a loop that is not working for me. What I want is to extract values from dataframes based in a condition to my final dataframe.
I have:
Final Dataframe:
final = {'code': ['A001','A002','A003'],
'reg': ['2234','3432', '6578'],
'name': ['Solutions BS', 'Flying 23', 'Fast Co'],
'df2_code': ['','',''],
'df2_name': ['', '', ''],
'df3_code': ['','',''],
'df3_name': ['', '', '']}
This dataframe must be fill. Specifically, the columns with the prefix df2, df3,...
It must be filled with the 'code' and 'name' column of other dataframes that contains the same firsts three column names of the 'final dataframe' (code, reg, name). A condition applies to fill, the 'reg' number must be the same in both dataframes.
An example of the others:
df2 = {'code': ['P001','A002','P003'],
'reg': ['2234','3432', '9978'],
'name': ['Chips 23', 'Flying 23', 'American99']}
So, until now, the product of this logic would be:
final = {'code': ['A001','A002','A003'],
'reg': ['2234','3432', '6578'],
'name': ['Solutions BS', 'Flying 23', 'Fast Co'],
'df2_code': ['P001','A002',''],
'df2_name': ['Chips 23', 'Flying 23', '']}
But, the problem is a little more complex. There are duplicates in the df2 of the 'reg' numbers which serve as conditions. So 'df2' actually is:
df2 = {'code': ['P001','A002','P003', 'B004'],
'reg': ['2234','3432', '9978', '2234'],
'name': ['Chips 23', 'Flying 23', 'American99', NaN]}
And this must be taken into account by adding the 'code' and the 'name' of the two in the same cells. The product would be:
final = {'code': ['A001','A002','A003'],
'reg': ['2234','3432', '6578'],
'name': ['Solutions BS', 'Flying 23', 'Fast Co'],
'df2_code': ['P001&B004','A002',''],
'df2_name': ['Chips 23', 'Flying 23', '']}
Until now, I have written this code for only one dataframe (df2) and it takes too many time as the final df has 200.000+ rows (I have 5 df to scan but these are tinier):
for i, row in final.iterrows():
for j, inrow in df2.iterrows():
if row['reg'] == inrow['reg']:
if final['df2_code'].iloc[i] == '':
final['df2_code'].iloc[i] = str(inrow['code'])
else:
final['df2_code'].iloc[i] += '&' + str(inrow['code'])
if inrow['name'] is None:
continue
else:
if final['df2_name'].iloc[i] == '':
final['df2_name'].iloc[i] = str(inrow['name'])
else:
final['df2_name'].iloc[i] += '&' + str(inrow['name'])
Consider Series.cat + groupby:
df2 = pd.DataFrame({'code': ['P001','A002','P003', 'B004'],
'reg': ['2234','3432', '9978', '2234'],
'name': ['Chips 23', 'Flying 23', 'American99', float('nan')]})
agg_df = (df2.assign(name = lambda x: x["name"].fillna(""))
.groupby(['reg'])
.agg({'code': lambda g: g.str.cat(sep="&"),
'name': 'max'})
.add_prefix("df2_")
)
agg_df
# df2_code df2_name
# reg
# 2234 P001&B004 Chips 23
# 3432 A002 Flying 23
# 9978 P003 American99
To iterate across data frames, run horizontal merge from list of data frames (using elementwise zip in list comprehension for prefixes).
# LIST OF AGGREGATED DATA FRAMES
dfs = [
(df.assign(name = lambda x: x["name"].fillna(""))
.groupby(['reg'])
.agg({'code': lambda g: g.fillna("").str.cat(sep="&"),
'name': max})
.add_prefix(f"{nm}_")
)
for df, nm
in zip([df2, df3, df4, df5], ["df2", "df3", "df4", "df5"])
]
# HORIZONTAL MERGE ON "reg"
final_df = pd.concat(dfs, axis=1)

How to create new columns using .loc syntax?

I have a list of names of columns (cols) that exist in one dataframe.
I want to insert columns by those names in another dataframe.
So I am using a for loop to iterate the list and create the columns one by one:
cols = ['DEPTID', 'ACCOUNT', 'JRNL LINE DESCRIPTION', 'JRNL DATE', 'BASE AMOUNT', 'FOREIGN CURRENCY', 'FOREIGN AMOUNT', 'JRNL SOURCE']
for col in cols:
# "summary" and "obiee" are dataframes
summary.loc[obiee['mapid'], col] = obiee[col].tolist()
I would like to get rid of the for loop, however.
So I have tried multiple column assignment using the .loc syntax:
cols = ['DEPTID', 'ACCOUNT', 'JRNL LINE DESCRIPTION', 'JRNL DATE', 'BASE AMOUNT', 'FOREIGN CURRENCY', 'FOREIGN AMOUNT', 'JRNL SOURCE']
summary.loc[obiee['mapid'], cols] = obiee[cols]
but Pandas will throw an error:
KeyError: "['DEPTID' 'ACCOUNT' 'JRNL LINE DESCRIPTION' 'JRNL DATE' 'BASE AMOUNT'\n 'FOREIGN CURRENCY' 'FOREIGN AMOUNT' 'JRNL SOURCE'] not in index"
Is it not possible with this syntax? How can I do this otherwise?
join
You can create a new dataframe and then join. From your problem description and sample code, 'mapid' represents index values in the summary dataframe. join is made to merge on index. So by setting obiee's index to 'mapid' then taking the the appropriate columns, we can just use join.
summary.join(obiee.set_index('mapid')[cols])
you have a dataFrame df1 .. with some columns...
And you want those in a df2 ... all you need to do is just equate them as show below
df2 = pd.DataFrame({ 'A' : 1.,
....: 'B' : pd.Timestamp('20130102'),
....: 'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
....: 'D' : np.array([3] * 4,dtype='int32'),
....: 'E' : pd.Categorical(["test","train","test","train"]),
....: 'F' : 'foo' })
df1 = pd.DataFrame({ 'G' : 1.,
....: 'H' : pd.Timestamp('20130102'),
....: 'I' : pd.Series(1,index=list(range(4)),dtype='float32'),
....: 'J' : np.array([3] * 4,dtype='int32'),
....: 'K' : pd.Categorical(["test","train","test","train"]),
....: 'L' : 'foo' })
df2['G'],df2['F'] = df1['G'],df1['H']

How can I force Pandas dataframe column query to return only str and not interpret what it is seeing?

I have an excel spreadsheet that has four column names: "Column One, Column Two, Jun-17, and Column Three"
When I display my column names after reading in the data I get something very different from the "Jun-17" text I was hoping to receive. What should I be doing differently?
import pandas as pd
df = pd.read_excel('Sample.xlsx', sheet_name='Sheet1')
print("Column headings:")
print(df.columns.tolist())
Column headings:
['Column One', 'Column Two', datetime.datetime(2018, 6, 17, 0, 0), 'Column Three']
One of your column names is a datetime object. You can rename it to a string using datetime.strftime. Example below.
import datetime
import pandas as pd, numpy as np
df = pd.DataFrame(columns=['Column One', 'Column Two',
datetime.datetime(2018, 6, 17, 0, 0), 'Column Three'])
df.columns.values[2] = df.columns[2].strftime('%b-%d')
# alternatively:
# df = df.rename(columns={df.columns[2]: df.columns[2].strftime('%b-%d')})
df.columns
# Index(['Column One', 'Column Two', 'Jun-17', 'Column Three'], dtype='object')
If you see this problem repeatedly, wrap your dataframe in a function:
def normalise_columns(df):
df.columns = [i.strftime('%b-%d') if isinstance(i, datetime.datetime) \
else i for i in df.columns]
return df
normalise_columns(df).columns

Categories

Resources