I've used Pandas to read an excel sheet that has two columns used create a key, value dictionary. When ran, the code will search for a key, and produce it's value. Ex: WSO-Exchange will be equal to 52206.
Although, when I search for 59904-FX's value, it returns 35444 when I need it to return 22035; It only throws this issue when a key is also a value later on. Any ideas on how I can fix this error? I'll attach my code below, thanks!
MapDatabase = {}
for i in Mapdf.index:
MapDatabase[Mapdf['General Code'][i]] = Mapdf['Upload Code'][i]
df["AccountID"][i] is reading another excel sheet to search if that cell is in the dictionary's keys, and if it is, then to change it to it's value.
for i in df.index:
for key, value in MapDatabase.items():
if str(df['AccountId'][i]) == str(key):
df['AccountId'][i] = value
I would just use the xlrd library to do this:
from xlrd import open_workbook
workbook = open_workbook("data.xlsx")
sheet = workbook.sheet_by_index(0)
data = {sheet.cell(row, 0).value: sheet.cell(row, 1).value for row in range(sheet.nrows)}
print(data)
Which gives the following dictionary:
{'General Code': 'Upload Code', '59904-FX': 22035.0, 'WSO-Exchange': 52206.0, 56476.0: 99875.0, 22035.0: 35444.0}
Check the Index for Duplicates in Your Excel File
Most likely the problem is that you are iterating over a non-unique index of the DataFrame Mapdf. Check that the first column in the Excel file you are using to build Mapdf is unique per row.
Don't Iterate Over DataFrame Rows
However, rather than trying to iterate row-wise over a DataFrame (which is almost always the wrong thing to do), you can build the dictionary with a call to the dict constructor, handing it an iterable of (key, value) pairs:
MapDatabase = dict(zip(Mapdf["General Code"], Mapdf["Upload Code"]))
Consider Merging Rather than Looping
Better yet, what you are doing seems like an ideal candidate for DataFrame.merge.
It looks like what you want to do is overwrite the values of AccountId in df with the values of Upload Code in Mapdf if AccountId has a match in General Code in Mapdf. That's a mouthful, but let's break it down.
First, merge Mapdf onto df by the matching columns (df["AccountId"] to Mapdf["General Code"]):
columns = ["General Code", "Upload Code"] # only because it looks like there are more columns you don't care about
merged = df.merge(Mapdf[columns], how="left", left_on = "AccountId", right_on="General Code")
Because this is a left join, rows in merged where column AccountId does not have a match in Mapdf["General Code"] will have missing values for Upload Code. Copy the non-missing values to overwrite AccountId:
matched = merged["Upload Code"].notnull()
merged.loc[matched, "AccountId"] = merged.loc[matched, "Upload Code"]
Then drop the extra columns if you'd like:
merged.drop(["Upload Code", "General Code"], axis="columns", inplace=True)
EDIT: Turns out I didn't need to do a nested for loop. The solution was to go from a for loop to an if statment.
for i in df.index:
if str(df['AccountId'][i]) in str(MapDatabase.items()):
df.at[i, 'AccountId'] = MapDatabase[df['AccountId'][i]]
Related
If I have a simple csv like this:
name,age,color,team,completed
tim,34,green,5
jim,31,blue,6
kim,33,yellow,5
I want to in python (pandas is fine if need be an third party module) find an id, so in this case name (row), and then update the value under 'completed' with a YES. The names will always be unique. The sheet may not always be in the same order, but the header names will always be the same.
Is there a way to find the cell coords at name=="Jim" and 'completed' ?
Good evening,
Importing CSV
While I understand you may desire only using core Python modules, I recommend using Pandas for this task.
import pandas as pd
df = pd.read_csv('csv_file.csv')
Conditional Variable Assignment
One way is to use .loc[row, column] to return rows where df['name'] == 'jim' and assign a new column "completed" to "YES". The rows where the column name is not equal to "jim" will be set to missing values.
df.loc[df['name'] == 'jim', 'completed'] = 'YES'
I'm using the pandas library in Python.
I've taken an excel file and stored the contents in a data frame by doing the following:
path = r"filepath"
sheets_dict = pd.read_excel(path,sheet_name=None)
As there was multiple sheets, each containing a table of data with identical columns, I used pd.read_excel(path,sheet_name=None). This stored all the individual sheets into a dictionary with the key for each value/sheet being the sheet name.
I now what to unpack the dictionary and place each sheet into a single data frame. I want to use the key of each sheet in the dictionary as either part of a mulitindex so I know what key/sheet of each table came from or appended as a new column which gives me the key/sheet name for each unique subset of the dataframe.
I've tried the following:
for k,df in sheets_dict.items():
df = pd.concat([pd.DataFrame(df)])
df['extract'] = k
However I'm not getting the results I want.
Any suggestions?
you can use the keys argument in pd.concat which will set the keys of your dict as the index.
df = pd.concat(sheets_dict.values(),keys=sheets_dict.keys())
by default, pd.concat(sheet_dict) will set the indices as the keys.
I am trying to insert or add from one dataframe to another dataframe. I am going through the original dataframe looking for certain words in one column. When I find one of these terms I want to add that row to a new dataframe.
I get the row by using.
entry = df.loc[df['A'] == item]
But when trying to add this row to another dataframe using .add, .insert, .update or other methods i just get an empty dataframe.
I have also tried adding the column to a dictionary and turning that into a dataframe but it writes data for the entire row rather than just the column value. So is there a way to add one specific row to a new dataframe from my existing variable ?
So the entry is a dataframe containing the rows you want to add?
you can simply concatenate two dataframe using concat function if both have the same columns' name
import pandas as pd
entry = df.loc[df['A'] == item]
concat_df = pd.concat([new_df,entry])
pandas.concat reference:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html
The append function expect a list of rows in this formation:
[row_1, row_2, ..., row_N]
While each row is a list, representing the value for each columns
So, assuming your trying to add one row, you shuld use:
entry = df.loc[df['A'] == item]
df2=df2.append( [entry] )
Notice that unlike python's list, the DataFrame.append function returning a new object and not changing the object called it.
See also enter link description here
Not sure how large your operations will be, but from an efficiency standpoint, you're better off adding all of the found rows to a list, and then concatenating them together at once using pandas.concat, and then using concat again to combine the found entries dataframe with the "insert into" dataframe. This will be much faster than using concat each time. If you're searching from a list of items search_keys, then something like:
entries = []
for i in search_keys:
entry = df.loc[df['A'] == item]
entries.append(entry)
found_df = pd.concat(entries)
result_df = pd.concat([old_df, found_df])
country = []
for i in df_temp['Customer Name'].iloc[:]:
if i in gui_broker['EXACT_DDI_CUSTOMER_NAME'].tolist():
country.append(gui_broker["Book"].values[gui_broker['EXACT_DDI_CUSTOMER_NAME'].tolist().index(i)])
else:
country.append("No Book Defined")
df_temp["Country"] = country
I have currently a large DataFrame (df_temp) with one column ('Customer Name') and am trying to match it with a small DataFrame (gui_broker) which has 3 columns - one of which has all unique values of the large DataFrame ('EXACT_DDI_CUSTOMER_NAME').
After matching the value row of df_temp I want to create a new column in df_temp with the value 'Book' of my small DataFrame (gui_broker) based on the matching. I tried every apply lambda method, but am out of clue. The above provided code provides me with a solution, but it's slow and not Pandas like...
How exactly could I proceed?
You can use pandas merge to do that. like this...
df_temp = df_temp.merge(gui_broker[['EXACT_DDI_CUSTOMER_NAME','Book']], left_on='Customer Name', right_on='EXACT_DDI_CUSTOMER_NAME', how='left' )
df_temp['Book'] = df_temp['Book'].fillna('No Book Defined')
Looks like you are looking for join (docs are here)
It joins DataFrame with the other by matching the selected column(s) in the first with the index in the second.
So
df_temp.join(gui_broker
.set_index("EXACT_DDI_CUSTOMER_NAME")
.loc[:, ["Book"]],
on="Customer Name")
I believe this should do it, using map to map the Book column of gui_broker by the EXACT_DDI_CUSTOMER_NAME, onto Custome Name in df_tmp, :
df_tmp['Country'] = (df_tmp['Customer Name']
.map(gui_broker.set_index('EXACT_DDI_CUSTOMER_NAME').Book)
.fillna('No Book Defined'))
Though I would need some example data to test it with!
I have a JSON object inside a pandas dataframe column, which I want to pull apart and put into other columns. In the dataframe, the JSON object looks like a string containing an array of dictionaries. The array can be of variable length, including zero, or the column can even be null. I've written some code, shown below, which does what I want. The column names are built from two components, the first being the keys in the dictionaries, and the second being a substring from a key value in the dictionary.
This code works okay, but it is very slow when running on a big dataframe. Can anyone offer a faster (and probably simpler) way to do this? Also, feel free to pick holes in what I have done if you see something which is not sensible/efficient/pythonic. I'm still a relative beginner. Thanks heaps.
# Import libraries
import pandas as pd
from IPython.display import display # Used to display df's nicely in jupyter notebook.
import json
# Set some display options
pd.set_option('max_colwidth',150)
# Create the example dataframe
print("Original df:")
df = pd.DataFrame.from_dict({'ColA': {0: 123, 1: 234, 2: 345, 3: 456, 4: 567},\
'ColB': {0: '[{"key":"keyValue=1","valA":"8","valB":"18"},{"key":"keyValue=2","valA":"9","valB":"19"}]',\
1: '[{"key":"keyValue=2","valA":"28","valB":"38"},{"key":"keyValue=3","valA":"29","valB":"39"}]',\
2: '[{"key":"keyValue=4","valA":"48","valC":"58"}]',\
3: '[]',\
4: None}})
display(df)
# Create a temporary dataframe to append results to, record by record
dfTemp = pd.DataFrame()
# Step through all rows in the dataframe
for i in range(df.shape[0]):
# Check whether record is null, or doesn't contain any real data
if pd.notnull(df.iloc[i,df.columns.get_loc("ColB")]) and len(df.iloc[i,df.columns.get_loc("ColB")]) > 2:
# Convert the json structure into a dataframe, one cell at a time in the relevant column
x = pd.read_json(df.iloc[i,df.columns.get_loc("ColB")])
# The last bit of this string (after the last =) will be used as a key for the column labels
x['key'] = x['key'].apply(lambda x: x.split("=")[-1])
# Set this new key to be the index
y = x.set_index('key')
# Stack the rows up via a multi-level column index
y = y.stack().to_frame().T
# Flatten out the multi-level column index
y.columns = ['{1}_{0}'.format(*c) for c in y.columns]
# Give the single record the same index number as the parent dataframe (for the merge to work)
y.index = [df.index[i]]
# Append this dataframe on sequentially for each row as we go through the loop
dfTemp = dfTemp.append(y)
# Merge the new dataframe back onto the original one as extra columns, with index mataching original dataframe
df = pd.merge(df,dfTemp, how = 'left', left_index = True, right_index = True)
print("Processed df:")
display(df)
First, a general piece of advice about pandas. If you find yourself iterating over the rows of a dataframe, you are most likely doing it wrong.
With this in mind, we can re-write your current procedure using pandas 'apply' method (this will likely speed it up to begin with, as it means far fewer index lookups on the df):
# Check whether record is null, or doesn't contain any real data
def do_the_thing(row):
if pd.notnull(row) and len(row) > 2:
# Convert the json structure into a dataframe, one cell at a time in the relevant column
x = pd.read_json(row)
# The last bit of this string (after the last =) will be used as a key for the column labels
x['key'] = x['key'].apply(lambda x: x.split("=")[-1])
# Set this new key to be the index
y = x.set_index('key')
# Stack the rows up via a multi-level column index
y = y.stack().to_frame().T
# Flatten out the multi-level column index
y.columns = ['{1}_{0}'.format(*c) for c in y.columns]
#we don't need to re-index
# Give the single record the same index number as the parent dataframe (for the merge to work)
#y.index = [df.index[i]]
#we don't need to add to a temp df
# Append this dataframe on sequentially for each row as we go through the loop
return y.iloc[0]
else:
return pd.Series()
df2 = df.merge(df.ColB.apply(do_the_thing), how = 'left', left_index = True, right_index = True)
Notice that this returns exactly the same result as before, we haven't changed the logic. the apply method sorts out the indexes, so we can just merge, fine.
I believe that answers your question in terms of speeding it up and being a bit more idiomatic.
I think you should consider however, what you want to do with this data structure, and how you might better structure what you're doing.
Given ColB could be of arbitrary length, you will end up with a dataframe with an arbitrary number of columns. When you come to access these values for whatever purpose, this will cause you pain, whatever the purpose is.
Are all entries in ColB important? Could you get away with just keeping the first one? Do you need to know the index of a certain valA val?
These are questions you should ask yourself, then decide on a structure which will allow you to do whatever analysis you need, without having to check a bunch of arbitrary things.