Merge dfs in a dictionary based on a column key - python

I have a dictionary like so: {key_1: pd.Dataframe, key_2: pd.Dataframe, ...}.
Each of these dfs within the dictionary has a column called 'ID'.
Not all instances appear in each dataframe meaning that the dataframes are of different size.
Is there anyway I could combine these into one large dataframe?
Here's a minimal reproducible example of the data:
data1 = [{'ID': 's1', 'country': 'Micronesia', 'Participants':3},
{'ID':'s2', 'country': 'Thailand', 'Participants': 90},
{'ID':'s3', 'country': 'China', 'Participants': 36},
{'ID':'s4', 'country': 'Peru', 'Participants': 30}]
data2 = [{'ID': '1', 'country': 'Micronesia', 'Kids_per_participant':3},
{'ID':'s2', 'country': 'Thailand', 'Kids_per_participant': 9},
{'ID':'s3', 'country': 'China', 'Kids_per_participant': 39}]
data3= [{'ID': 's1', 'country': 'Micronesia', 'hair_style_rank':3},
{'ID':'s2', 'country': 'Thailand', 'hair_style_rank': 9}]
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df3 = pd.DataFrame(data3)
dict_example={'df1_key':df1,'df2_key':df2,'df3_key':df3}
pd.merge(dict_example.values(), on="ID", how="outer")

For a dict with arbitrary number of keys you could do this
i=list(dict_example.keys())
newthing = dict_example[i[0]]
for j in range(1,len(i)):
newthing = newthing.merge(dict_example[i[j]],on='ID', how = 'outer')
First make a list of your dataframes. Second create a first DataFrame. Then iterate through the rest of your DataFrames and merge each one after that. I did notice you have country for each ID, but it's not listing in your initial on statement. Do you want to join on country also? If so replace the merge above with this changing the join criteria to a list including country
newthing = newthing.merge(dict_example[i[j]],on=['ID','country'], how = 'outer')
Documents on merge
If you don't care about altering your DataFrames code could be shorter like this
for j in range(1,len(i)):
df1 = df1.merge(dict_example[i[j]],on=['ID','country'], how = 'outer')

Related

Convert a muti-valued dict into a pandas dataframe

I want to convert this dict into a pandas dataframe where each key becomes a column and values in the list become the rows:
my_dict:
{'Last updated': ['2021-05-18T15:24:19.000Z', '2021-05-18T15:24:19.000Z'],
'Symbol': ['BTC', 'BNB', 'XRP', 'ADA', 'BUSD'],
'Name': ['Bitcoin', 'Binance Coin', 'XRP', 'Cardano', 'Binance USD'],
'Rank': [1, 3, 7, 4, 25],
}
The lists in my_dict can also have some missing values, which should appear as NaNs in dataframe.
This is how I'm currently trying to append it into my dataframe:
df = pd.DataFrame(columns = ['Last updated',
'Symbol',
'Name',
'Rank',]
df = df.append(my_dict, ignore_index=True)
#print(df)
df.to_excel(r'\walletframe.xlsx', index = False, header = True)
But my output only has a single row containing all the values.
The answer was pretty simple, instead of using
df = df.append(my_dict)
I used
df = pd.DataFrame.from_dict(my_dict).T
Which transposes the dataframe so it doesn't has any missing values for columns.
Credits to #Ank who helped me find the solution!

Create a dictionary of unique values of a column in a dataframe in pandas

I have a dataframe:
import pandas as pd
df = pd.DataFrame({
'ID': ['ABC', 'ABC', 'ABC', 'XYZ', 'XYZ', 'XYZ'],
'value': [100, 120, 130, 200, 190, 210],
'value2': [2100, 2120, 2130, 2200, 2190, 2210],
'state': ['init','mid', 'final', 'init', 'mid', 'final'],
})
I want to create dictionary of unique values of the Column 'ID'. I can extract the unique values by:
df.ID.unique()
But that gives me a list. I want the output to be a dictionary, which looks like this:
dict = {0:'ABC', 1: 'XYZ'}
If the number of unique entries in the column is n, then the keys should start at 0 and go till n-1. The values should be the names of unique entries in the column
The actual dataframe has 1000s of rows and is often updated. So I cannot maintain the dict manually.
Try this. -
dict(enumerate(df.ID.unique()))
{0: 'ABC', 1: 'XYZ'}
If you want to get unique values for a particular column in dict, try:
val_dict = {idx:value for idx , value in enumerate(df["ID"].unique())}
Output while printing val_dict
{0: 'ABC', 1: 'XYZ'}

How to create a dict of dicts from pandas dataframe?

I have a dataframe df
id price date zipcode
u734 8923944 2017-01-05 AERIU87
uh72 9084582 2017-07-28 BJDHEU3
u029 299433 2017-09-31 038ZJKE
I want to create a dictionary with the following structure
{'id': xxx, 'data': {'price': xxx, 'date': xxx, 'zipcode': xxx}}
What I have done so far
ids = df['id']
prices = df['price']
dates = df['date']
zips = df['zipcode']
d = {'id':idx, 'data':{'price':p, 'date':d, 'zipcode':z} for idx,p,d,z in zip(ids,prices,dates,zips)}
>>> SyntaxError: invalid syntax
but I get the error above.
What would be the correct way to do this, using either
list comprehension
OR
pandas .to_dict()
bonus points: what is the complexity of the algorithm, and is there a more efficient way to do this?
I'd suggest the list comprehension.
v = df.pop('id')
data = [
{'id' : i, 'data' : j}
for i, j in zip(v, df.to_dict(orient='records'))
]
Or a compact version,
data = [dict(id=i, data=j) for i, j in zip(df.pop('id'), df.to_dict(orient='r'))]
Note that, if you're popping id inside the expression, it has to be the first argument to zip.
print(data)
[{'data': {'date': '2017-09-31',
'price': 299433,
'zipcode': '038ZJKE'},
'id': 'u029'},
{'data': {'date': '2017-01-05',
'price': 8923944,
'zipcode': 'AERIU87'},
'id': 'u734'},
{'data': {'date': '2017-07-28',
'price': 9084582,
'zipcode': 'BJDHEU3'},
'id': 'uh72'}]

Separate pd DataFrame Rows that are dictionaries into columns

I am extracting some data from an API and having challenges transforming it into a proper dataframe.
The resulting DataFrame df is arranged as such:
Index Column
0 {'email#email.com': [{'action': 'data', 'date': 'date'}, {'action': 'data', 'date': 'date'}]}
1 {'different-email#email.com': [{'action': 'data', 'date': 'date'}]}
I am trying to split the emails into one column and the list into a separate column:
Index Column1 Column2
0 email#email.com [{'action': 'data', 'date': 'date'}, {'action': 'data', 'date': 'date'}]}
Ideally, each 'action'/'date' would have it's own separate row, however I believe I can do the further unpacking myself.
After looking around I tried/failed lots of solutions such as:
df.apply(pd.Series) # does nothing
pd.DataFrame(df['column'].values.tolist()) # makes each dictionary key as a separate colum
where most of the rows are NaN except one which has the pair value
Edit:
As many of the questions asked the initial format of the data in the API, it's a list of dictionaries:
[{'email#email.com': [{'action': 'data', 'date': 'date'}, {'action': 'data', 'date': 'date'}]},{'different-email#email.com': [{'action': 'data', 'date': 'date'}]}]
Thanks
One naive way of doing this is as below:
inp = [{'email#email.com': [{'action': 'data', 'date': 'date'}, {'action': 'data', 'date': 'date'}]}
, {'different-email#email.com': [{'action': 'data', 'date': 'date'}]}]
index = 0
df = pd.DataFrame()
for each in inp: # iterate through the list of dicts
for k, v in each.items(): #take each key value pairs
for eachv in v: #the values being a list, iterate through each
print (str(eachv))
df.set_value(index,'Column1',k)
df.set_value(index,'Column2',str(eachv))
index += 1
I am sure there might be a better way of writing this. Hope this helps :)
Assuming you have already read it as dataframe, you can use following -
import ast
df['Column'] = df['Column'].apply(lambda x: ast.literal_eval(x))
df['email'] = df['Column'].apply(lambda x: x.keys()[0])
df['value'] = df['Column'].apply(lambda x: x.values()[0])

Sorting array data by common date

I have a .csv file with many rows and 3 columns: Date, Rep, and Sales. I would like to use Python to generate a new array that groups the data by Date and, for the given date, sorts the Reps by Sales. As an example, my input data looks like this:
salesData = [[201703,'Bob',3000], [201703,'Sarah',6000], [201703,'Jim',9000],
[201704,'Bob',8000], [201704,'Sarah',7000], [201704,'Jim',12000],
[201705,'Bob',15000], [201705,'Sarah',14000], [201705,'Jim',8000],
[201706,'Bob',10000], [201706,'Sarah',18000]]
My desired output would look like this:
sortedData = [[201703,'Jim', 'Sarah', 'Bob'], [201704,'Jim', 'Bob',
'Sarah'], [201705,'Bob', 'Sarah', 'Jim'], [201706, 'Sarah', 'Bob']]
I am new to Python, but I have searched quite a bit for a solution with no success. Most of my search results lead me to believe there may be an easy way to do this using pandas (which I have not used) or numpy (which I have used).
Any suggestions would be greatly appreciated. I am using Python 3.6.
Use Pandas!
import pandas as pd
salesData = [[201703, 'Bob', 3000], [201703, 'Sarah', 6000], [201703, 'Jim', 9000],
[201704, 'Bob', 8000], [201704, 'Sarah', 7000], [201704, 'Jim', 12000],
[201705, 'Bob', 15000], [201705, 'Sarah', 14000], [201705, 'Jim', 8000],
[201706, 'Bob', 10000], [201706, 'Sarah', 18000]]
sales_df = pd.DataFrame(salesData)
result = []
for name, group in sales_df.groupby(0):
sorted_df = group.sort_values(2, ascending=False)
result.append([name] + list(sorted_df[1]))
print(result)
Without pandas, you can try this one line answer:
sortedData = [[i]+[item[1] for item in salesData if item[0]==i] for i in sorted(set([item[0] for item in salesData]))]
EDIT:
You can do this to order each inner list by sales:
sortedData = [[i]+[item[1] for item in sorted(salesData, key=lambda x: -x[2]) if item[0]==i] for i in sorted(set([item[0] for item in salesData]))]
Note that sorted(salesData, key=lambda x: -x[2]) part performs the ordering

Categories

Resources