problems manipulation dataframes in dictionary from pd.read_excel - python

Good day all,
I have an excel with multiple sheets which I load via pd.read_excel into a dict object.
Each excel sheet contains some column names which are the same:
Like sheet1: ['IDx','IDy'], sheet2: ['IDx','IDy']
I usually use pd.concat in order to load everything into one data frame. This way same column names also get merged correctly.
But this way is not robust in the case that there is some accidental extra white space in one of the ID columns but they're actually meant to get merged with pd.concat.
like: sheet1: ['IDx','IDy'], sheet2: ['IDx',' IDy']
To tackle this problem I thought to iterate over each data frame from the excel dict and strip() the column names of white space and then pd.concat afterwards.
excel = pd.read_excel('D:\ExcelFile.xlsx', sheet_name=None)
new_excel = {}
for name, sheet in excel.items():
sheet = sheet.columns.str.strip()
#print(sheet)
new_excel[name] = sheet
print(new_excel)
output:
{'Sheet1': Index([ 'IDx', 'IDy', ..... ]}
at this point I am stuck. I can't do anything with the new_excel dict. It seems that I am accessing each data frame incorrectly and just get the Index object. I can't get my head around this issue.
When trying to concat with new_excel:
TypeError: cannot concatenate object of type '<class 'pandas.core.indexes.base.Index'>'; only Series and DataFrame objs are valid
Many thanks in advance!

does this work?
new_excel = {}
for name, sheet in excel.items():
new_columns = [str(col).strip() for col in sheet.columns]
sheet.columns = new_columns
new_excel[name] = sheet

No, sadly not. I have integers in my col names (Attribute Error: 'int' object has no attribute 'strip'). When trying to change the type of the columns I ran into trouble with key error and the indexes being messed up. I tried to circumvent this with another for loop but failed so far to get it to work. I am not very good with this
for name, sheet in excel.items():
for i in sheet:
if type(i) != int:
new_columns = [col.strip() for col in sheet.columns]
else:
new_columns = sheet.columns
print(new_columns)
sheet.columns = new_columns
new_excel[name] = sheet
still getting AttributeError: 'int' object has no attribute 'strip'
Error is clear but not the solution for me

Related

Creating lists of dataframes to iterate over those dataframes

In Python, I'm reading an Excel file with multiple sheets, to create each sheet as its own dataframe. I then want to create a list of those dataframes 'names' so i can use it in list comprehension.
The code below lets me create all the sheets into their own dataframe, and because some of the sheet names have special characters, I use regex to replace them with an underscore:
df = pd.read_excel('Book1.xlsx', sheet_name=None)
df = {re.sub(r"[-+\s]", "_", k): v for k,v in df.items()}
for key in df.keys():
globals()[key] = df[key]
Then, I can get a list of the dataframes by: all_dfs= %who_ls DataFrame,
giving me: ['GB_SF_NZ', 'GF_1', 'H_2_S_Z']
When I try the following code:
for df in all_dfs:
df.rename(columns=df.iloc[2], inplace=True)
I get the error 'AttributeError: 'str' object has no attribute 'rename'
But, if I manually type out the names of the dataframes into a list (note here that they are not in quotation marks):
manual_dfs = [GB_SF_NZ, GF_1, H_2_S_Z]
for df in manual_dfs:
df.rename(columns=df.iloc[2], inplace=True)
I don't get an error and the code works as expected.
Can anyone help me with why one works (without quotation) but the other (extracted with quotation) doesn't, and
how can I create a list of dataframes without the quote marks?
Following your code, you can do.:
for df in all_dfs:
globals()[df].rename(columns=df.iloc[2], inplace=True)

if str.contains() condition on a dataframe's content; AttributeError: 'str' object has no attribute 'str'

def store_press_release_links(all_sublinks_df, column_names):
all_press_release_links_df = pd.DataFrame(columns=column_names)
for i in range(len(all_sublinks_df)):
if (len(all_sublinks_df.loc[i,'sub_link'].str.contains('press|releases|Press|Releases|PRESS|RELEASES'))>0):
all_press_release_links_df.loc[i, 'link'] = all_sublinks_df[i,'link']
all_press_release_links_df.loc[i, 'sub_link'] = all_sublinks_df[i,'sublink']
else:
continue
all_press_release_links_df = all_press_release_links_df.drop_duplicates()
all_press_release_links_df.reset_index(drop=True, inplace=True)
return all_press_release_links_df
store_press_release_links() is a function that accepts a dataframe, all_sublinks_DF, which has two columns. 1. link 2. sub_link
The contents of both these columns are iink names.
I want to look through all the link names present in the sub_link column of the all_sublinks_DF Dataframe one by one and check if the link has the keywords ' press|releases|Press|Releases|PRESS|RELEASES' in it.
If it does, then I want to store that entire row of the all_sublinks_DF Dataframe to a new dataframe all_press_release_links_df.
But when I run this function it gives the error : AttributeError: 'str' object has no attribute 'str'
Where am I going wrong?
I think what you want to do, instead of the loop just use
all_press_release_links_df = all_sublinks_df[all_sublinks_df['sub_link'].str.contains('press|releases|Press|Releases|PRESS|RELEASES')]
There are many things wrong here. There are almost no situations where you want to loop through a pandas dataframe and equally there are almost no situations where you want to build a pandas dataframe row by row.
Here, the entire operation can be done with a single operation (split on two lines for readability) like such:
def store_press_release_links(all_sublinks_df):
check_str = 'press|releases|Press|Releases|PRESS|RELEASES'
return all_sublinks_df[all_sublinks_df.link.str.contains(check_str)]
The reason why you had the error message you had is because you select individual cells of the dataframe, which are not of type pandas.Series. The str property is a pd.Series property.
Also note how the columns field is no longer needed.

Pandas. How to call data frame from the list of data frames?

I have a list of data frames:
all_df = ['df_0','df_1','df_2','df_3','df_4','df_5','df_6']
How can I call them from this list to do something like that:
for (df,names) in zip(all_df,names):
df.to_csv('output/{}.csv'.format(names))
When executing expectedly I'm getting error 'str' object has no attribute 'to_csv' since I'm giving string to the 'to_csv'.
How can I save several data-frames (or perform other actions on them) in the for loop?
Thanks!
Could you please also give an idea on how to create the 'right' list of data frames from this:
path = 'inp/multysheet_excel_01.xlsx'
xl = pd.ExcelFile(path)
sh_name = xl.sheet_names
n = 0
for i in sh_name:
exec('df_{} = pd.read_excel("{}", sheet_name="{}")'.format(n, path, i))
n+=1
so basically I'm trying to get the each sheet of an input excel as a separate dataframe, perform some actions on them, and save each output dataframe in separate excels.
You're quite close, but I see some mistakes in that for loop. Say you have a list of dataframes dfs, and its corresponding names as a list of strings names, you can save those dataframes using the names as:
dfs = [df_1, df_2, df_3]
names = ['df_0','df_1','df_2']
for df,name in zip(dfs,names):
df.to_csv('output\\{}.csv'.format(name))
Though if you only had a list of names, you could also do something like:
names = ['df_0','df_1','df_2']
for name in names:
globals()[name].to_csv('output\\{}.csv'.format(name))

Add data from JSON in existing Pandas dataframe

I'm trying to add a new row at the top of my existing dataframe (df_PRED). Data are coming from a json. The keys of the json (df_NEW) have exactly the same naming like the columns in the existing dataframe.
df_NEW = pd.read_json(dataJSON, lines=True)
df_PRED[-1] = df_NEW
Error: Wrong number of items passed 36, placement implies 1
What's going wrong? Thank you for your hints.
You can concatenate df_PRED and df_NEW:
df_PRED = pd.concat([df_NEW,df_PRED])

Iteration through dictionary pulling the wrong value for searched key

I've used Pandas to read an excel sheet that has two columns used create a key, value dictionary. When ran, the code will search for a key, and produce it's value. Ex: WSO-Exchange will be equal to 52206.
Although, when I search for 59904-FX's value, it returns 35444 when I need it to return 22035; It only throws this issue when a key is also a value later on. Any ideas on how I can fix this error? I'll attach my code below, thanks!
MapDatabase = {}
for i in Mapdf.index:
MapDatabase[Mapdf['General Code'][i]] = Mapdf['Upload Code'][i]
df["AccountID"][i] is reading another excel sheet to search if that cell is in the dictionary's keys, and if it is, then to change it to it's value.
for i in df.index:
for key, value in MapDatabase.items():
if str(df['AccountId'][i]) == str(key):
df['AccountId'][i] = value
I would just use the xlrd library to do this:
from xlrd import open_workbook
workbook = open_workbook("data.xlsx")
sheet = workbook.sheet_by_index(0)
data = {sheet.cell(row, 0).value: sheet.cell(row, 1).value for row in range(sheet.nrows)}
print(data)
Which gives the following dictionary:
{'General Code': 'Upload Code', '59904-FX': 22035.0, 'WSO-Exchange': 52206.0, 56476.0: 99875.0, 22035.0: 35444.0}
Check the Index for Duplicates in Your Excel File
Most likely the problem is that you are iterating over a non-unique index of the DataFrame Mapdf. Check that the first column in the Excel file you are using to build Mapdf is unique per row.
Don't Iterate Over DataFrame Rows
However, rather than trying to iterate row-wise over a DataFrame (which is almost always the wrong thing to do), you can build the dictionary with a call to the dict constructor, handing it an iterable of (key, value) pairs:
MapDatabase = dict(zip(Mapdf["General Code"], Mapdf["Upload Code"]))
Consider Merging Rather than Looping
Better yet, what you are doing seems like an ideal candidate for DataFrame.merge.
It looks like what you want to do is overwrite the values of AccountId in df with the values of Upload Code in Mapdf if AccountId has a match in General Code in Mapdf. That's a mouthful, but let's break it down.
First, merge Mapdf onto df by the matching columns (df["AccountId"] to Mapdf["General Code"]):
columns = ["General Code", "Upload Code"] # only because it looks like there are more columns you don't care about
merged = df.merge(Mapdf[columns], how="left", left_on = "AccountId", right_on="General Code")
Because this is a left join, rows in merged where column AccountId does not have a match in Mapdf["General Code"] will have missing values for Upload Code. Copy the non-missing values to overwrite AccountId:
matched = merged["Upload Code"].notnull()
merged.loc[matched, "AccountId"] = merged.loc[matched, "Upload Code"]
Then drop the extra columns if you'd like:
merged.drop(["Upload Code", "General Code"], axis="columns", inplace=True)
EDIT: Turns out I didn't need to do a nested for loop. The solution was to go from a for loop to an if statment.
for i in df.index:
if str(df['AccountId'][i]) in str(MapDatabase.items()):
df.at[i, 'AccountId'] = MapDatabase[df['AccountId'][i]]

Categories

Resources