I'm using the pandas library in Python.
I've taken an excel file and stored the contents in a data frame by doing the following:
path = r"filepath"
sheets_dict = pd.read_excel(path,sheet_name=None)
As there was multiple sheets, each containing a table of data with identical columns, I used pd.read_excel(path,sheet_name=None). This stored all the individual sheets into a dictionary with the key for each value/sheet being the sheet name.
I now what to unpack the dictionary and place each sheet into a single data frame. I want to use the key of each sheet in the dictionary as either part of a mulitindex so I know what key/sheet of each table came from or appended as a new column which gives me the key/sheet name for each unique subset of the dataframe.
I've tried the following:
for k,df in sheets_dict.items():
df = pd.concat([pd.DataFrame(df)])
df['extract'] = k
However I'm not getting the results I want.
Any suggestions?
you can use the keys argument in pd.concat which will set the keys of your dict as the index.
df = pd.concat(sheets_dict.values(),keys=sheets_dict.keys())
by default, pd.concat(sheet_dict) will set the indices as the keys.
Related
I'm looping through list of jsons and storing them in a dataframe. for each iteration i want to write the dataframe into excel with different sheets. how to achieve this?
for item in data:
#removing empty columns in raw data
drop_none = lambda path, key, value: key is not None and value is not None
cleaned = remap(item, visit=drop_none)
new_data=flatten(cleaned)
#my_df = new_data.dropna(axis='columns', how='all') # Drops columns with all NA values
dfFromRDD2 = spark.createDataFrame(new_data)
I want to save the dataframe dfFromRDD2 to excel with different sheets on each iterations.
is their a way to do it using python?
Given a dictionary with multiple dataframes in it. How I can add a column to each dataframe with all the rows in that df filled with the key name'?
I tried this code:
for key, df in sheet_to_df_map.items():
df['sheet_name'] = key
This code does add the key column in each dataframe inside the dictionary, but also creates an additional dataframe.
Can't this be done without creating an additional dataframe?
Furthermore, I want to separate dataframes from the dictionary by number of columns. All the dataframes that have 10 columns concatenated, the ones with 9 concatenated and so on. I don't know how to do this.
I could do it with the method assign() in the DataFrames and then replacing the hole value in the dictionary, but I don't know in fact if it's this that you want...
for key, df in myDictDf.items():
myDictDf[key] = df.assign(sheet_name=[key for w in range(len(df.index))])
To sort your dictionary, I think you can use an OrderedDict with the columns property of the DataFrames.
By using len(df.columns) you can get the quantity of columns for each frame.
I think these links can be useful for you:
https://note.nkmk.me/en/python-pandas-len-shape-size/
https://www.geeksforgeeks.org/python-sort-python-dictionaries-by-key-or-value/
I've found a related question too:
Adding new column to existing DataFrame in Python pandas
I am reading an excel file with around 2000 sheets with pandas. The excel sheet is loaded as an Ordered Dictionary since I have used the following:
test = pd.read_excel('test.xlsx', sheet_name=None)
Let's assume that it looks like that:
I would like to modify the name of the sheets and save the Ordered Dictionary to an excel file again. The names of the sheets are stored as the keys of the Ordered Dictionary, so basically I would like to just modify the keys and save to excel again.
As it can be noticed the name of the key ends with a year, i.e. 2020, 2022 etc. I would like all the keys to be modified such that they are reduced by 1, so the name of the keys are now 2019, 2021 etc. I would also like to make sure that the content does not change; meaning that the dataframe that used to assigned to A.AA.XX2020 is now assigned to A.AA.XX2019. The "General" sheets does not have to be modified.
Since there are many sheets in the excel file, I would prefer an automated procedure.
I hope following will suit your needs:
import pandas as pd
# read Excel file
test = pd.read_excel('test.xlsx', sheet_name=None)
# get keys from dict without 'General'
keys = list(test.keys())
keys.remove('General')
# iterate over keys
for key in keys:
# get the old year
year_old = int(key[-4:])
# make the new year
year_new = year_old - 1
# create name for new key
key_new = key[:-4] + str(year_new)
# copy values from old key in new key and delete old key
test[key_new] = test.pop(key)
# write dataframes from dict in one Excel file with new sheet names
with pd.ExcelWriter('test_new.xlsx') as writer:
for sheet_name, df in test.items():
df.to_excel(writer, sheet_name=sheet_name)
I've used Pandas to read an excel sheet that has two columns used create a key, value dictionary. When ran, the code will search for a key, and produce it's value. Ex: WSO-Exchange will be equal to 52206.
Although, when I search for 59904-FX's value, it returns 35444 when I need it to return 22035; It only throws this issue when a key is also a value later on. Any ideas on how I can fix this error? I'll attach my code below, thanks!
MapDatabase = {}
for i in Mapdf.index:
MapDatabase[Mapdf['General Code'][i]] = Mapdf['Upload Code'][i]
df["AccountID"][i] is reading another excel sheet to search if that cell is in the dictionary's keys, and if it is, then to change it to it's value.
for i in df.index:
for key, value in MapDatabase.items():
if str(df['AccountId'][i]) == str(key):
df['AccountId'][i] = value
I would just use the xlrd library to do this:
from xlrd import open_workbook
workbook = open_workbook("data.xlsx")
sheet = workbook.sheet_by_index(0)
data = {sheet.cell(row, 0).value: sheet.cell(row, 1).value for row in range(sheet.nrows)}
print(data)
Which gives the following dictionary:
{'General Code': 'Upload Code', '59904-FX': 22035.0, 'WSO-Exchange': 52206.0, 56476.0: 99875.0, 22035.0: 35444.0}
Check the Index for Duplicates in Your Excel File
Most likely the problem is that you are iterating over a non-unique index of the DataFrame Mapdf. Check that the first column in the Excel file you are using to build Mapdf is unique per row.
Don't Iterate Over DataFrame Rows
However, rather than trying to iterate row-wise over a DataFrame (which is almost always the wrong thing to do), you can build the dictionary with a call to the dict constructor, handing it an iterable of (key, value) pairs:
MapDatabase = dict(zip(Mapdf["General Code"], Mapdf["Upload Code"]))
Consider Merging Rather than Looping
Better yet, what you are doing seems like an ideal candidate for DataFrame.merge.
It looks like what you want to do is overwrite the values of AccountId in df with the values of Upload Code in Mapdf if AccountId has a match in General Code in Mapdf. That's a mouthful, but let's break it down.
First, merge Mapdf onto df by the matching columns (df["AccountId"] to Mapdf["General Code"]):
columns = ["General Code", "Upload Code"] # only because it looks like there are more columns you don't care about
merged = df.merge(Mapdf[columns], how="left", left_on = "AccountId", right_on="General Code")
Because this is a left join, rows in merged where column AccountId does not have a match in Mapdf["General Code"] will have missing values for Upload Code. Copy the non-missing values to overwrite AccountId:
matched = merged["Upload Code"].notnull()
merged.loc[matched, "AccountId"] = merged.loc[matched, "Upload Code"]
Then drop the extra columns if you'd like:
merged.drop(["Upload Code", "General Code"], axis="columns", inplace=True)
EDIT: Turns out I didn't need to do a nested for loop. The solution was to go from a for loop to an if statment.
for i in df.index:
if str(df['AccountId'][i]) in str(MapDatabase.items()):
df.at[i, 'AccountId'] = MapDatabase[df['AccountId'][i]]
I am new to python.
I need to read csv file which has various columns.
In csv file One column contains data like key and value pairs.
Using pandas how to extract the keys and values of that column from csv.
Ex: column name : fruit
Data in that column :
{ ""apple": "1,2,3,4", "orange":"5,6,7,8"}
How to get keys and its values of fruit column from csv file?
Any suggestions?
To read the .csv, I use pandas.
import pandas as pd
fruit_df = pd.read('directory_where_csv_is_saved/file_name.csv')
I would widen the dataframe to hold more columns by unpacking the dictionary column "Fruit" by first getting the keys from the dictionaries within the 'Fruit' column and converting that set of keys to an iterable list.
key_set = set()
for i in range(len(fruit_df)):
for key in fruit_df['Fruit'][i].keys():
if key not in key_set:
key_set.add(key)
else:
pass
key_set_list = list(key_set)
Then unpack the dictionary:
for i in range(len(key_set_list)):
fruit_df[key_set_list[i]] = [d.get(key_set_list[i]) for d in fruit_df['Fruit']]
Your dataframe should be wider (more columns) with each new column being the dictionary key, and the respective values in the applicable rows.