Join dataframe iteratively - python

Say I have a long list and I want to iteratively join them to produce a final dataframe.
The data is originally in dict so I need to iterate over the dictionary first.
header = ['apple', 'pear', 'cocoa']
for key, value in data.items():
for idx in header:
# Flatten the dictionary to dataframe
data_df = pd.json_normalize(data[key][idx])
# Here I start to lose.....
How can I iteratively join the dataframe?
Manually it can be done like this:
data_df = pd.json_normalize(data["ParentKey"]['apple'])
data_df1 = pd.json_normalize(data["ParentKey"]['pear'])
final_df = data_df1.join(data_df, lsuffix='_left')
# or
final_df = pd.concat([data_df, data_df1], axis=1, sort=False)
Since the list will be large, I want to iterate them instead.. How can I achieve this?

Is this what you're looking for? You can use k as a counter to indicate whether or not it's the first iterator and then for future ones just join it to that same dataframe:
header = ['apple', 'pear', 'cocoa']
for key, value in data.items():
k = 0
for idx in header:
data_df = pd.json_normalize(data[key][idx])
if k==0:
final_df = data_df
else:
final_df = final_df.join(data_df, lsuffix='_left')
k += 1

Related

Extract key value pairs from dict in pandas column using list items in another column

Trying to create a new column that is the key/value pairs extracted from a dict in another column using list items in a second column.
Sample Data:
names name_dicts
['Mary', 'Joe'] {'Mary':123, 'Ralph':456, 'Joe':789}
Expected Result:
names name_dicts new_col
['Mary', 'Joe'] {'Mary':123, 'Ralph':456, 'Joe':789} {'Mary':123, 'Joe':789}
I have attempted to use AST to convert the name_dicts column to a column of true dictionaries.
This function errored out with a "cannot convert string" error.
col here is the df['name_dicts'] col
def get_name_pairs(col):
for k,v in col.items():
if k.isin(df['names']):
return
Using a list comprehension and operator.itemgetter:
from operator import itemgetter
df['new_col'] = [dict(zip(l, itemgetter(*l)(d)))
for l,d in zip(df['names'], df['name_dicts'])]
output:
names name_dicts new_col
0 [Mary, Joe] {'Mary': 123, 'Ralph': 456, 'Joe': 789} {'Mary': 123, 'Joe': 789}
used input:
df = pd.DataFrame({'names': [['Mary', 'Joe']],
'name_dicts': [{'Mary':123, 'Ralph':456, 'Joe':789}]
})
You can apply a lambda function with dictionary comprehension at row level to get the values from the dict in second column based on the keys in the list of first column:
# If col values are stored as string:
import ast
for col in df:
df[col] = df[col].apply(ast.literal_eval)
df['new_col']=df.apply(lambda x: {k:x['name_dicts'].get(k,0) for k in x['names']},
axis=1)
# Replace above lambda by
# lambda x: {k:x['name_dicts'][k] for k in x['names'] if k in x['name_dicts']}
# If you want to include only key/value pairs for the key that is in
# both the list and the dictionary
names ... new_col
0 [Mary, Joe] ... {'Mary': 123, 'Joe': 789}
[1 rows x 3 columns]
PS: ast.literal_eval runs without error for the sample data you have posted for above code.
Your function needs only small change - and you can use it with .apply()
import pandas as pd
df = pd.DataFrame({
'names': [['Mary', 'Joe']],
'name_dicts': [{'Mary':123, 'Ralph':456, 'Joe':789}],
})
def filter_data(row):
result = {}
for key, val in row['name_dicts'].items():
if key in row['names']:
result[key] = val
return result
df['new_col'] = df.apply(filter_data, axis=1)
print(df.to_string())
Result:
names name_dicts new_col
0 [Mary, Joe] {'Mary': 123, 'Ralph': 456, 'Joe': 789} {'Mary': 123, 'Joe': 789}
EDIT:
If you have string "{'Mary':123, 'Ralph':456, 'Joe':789}" in name_dicts then you can replace ' with " and you will have json which you can convert to dictionary using json.loads
import json
df['name_dicts'] = df['name_dicts'].str.replace("'", '"').apply(json.loads)
Or directly convert it as Python's code:
import ast
df['name_dicts'] = df['name_dicts'].apply(ast.literal_eval)
Eventually:
df['name_dicts'] = df['name_dicts'].apply(eval)
Full code:
import pandas as pd
df = pd.DataFrame({
'names': [['Mary', 'Joe']],
'name_dicts': ["{'Mary':123, 'Ralph':456, 'Joe':789}",], # strings
})
#import json
#df['name_dicts'] = df['name_dicts'].str.replace("'", '"').apply(json.loads)
#df['name_dicts'] = df['name_dicts'].apply(eval)
import ast
df['name_dicts'] = df['name_dicts'].apply(ast.literal_eval)
def filter_data(row):
result = {}
for key, val in row['name_dicts'].items():
if key in row['names']:
result[key] = val
return result
df['new_col'] = df.apply(filter_data, axis=1)
print(df.to_string())

How to use zip function to associate column number with value of excel cell using openpyxl

I'm creating a dictionary where the keys should be the row number and the values of the dictionary should be a list of column numbers with the order being determined by the values of that row, sorted in descending order.
My code below is:
from openpyxl import load_workbook
vWB = load_workbook(filename="voting.xlsx")
vSheet = vWB.active
d = {}
for idx, row in enumerate(vSheet.values, start=1):
row = sorted(row, reverse=True)
d[idx] = row
output:
{1: [0.758968208500514, 0.434362232763003, 0.296177589742431, 0.0330331941352554], 2: [0.770423104229537, 0.770423104229537, 0.559322784244604, 0.455791535747786] etc..}
What I want:
{1: [4,2,1,3], 2: [3,4,1,2], etc..}
I've been trying to create a number to represent the column number of each value
genKey = [i for i in range(values.max_column)
And then using the zip function to associate the column number with each value
dict(zip(column key list, values list)
So I have a dict with columns as keys 1,2,n and values as values and then I can sort the keys in Descending order and I can iterate over row and zip again with the key being the row number.
I'm unsure how to use this zip function and get to my desired endpoint. Any help is welcomed.
Simply use enumerate:
dictionary = {
row_id: [cell.value for cell in row]
for row_id, row in enumerate(vSheet.rows)
}
enumerate can be applied to iterables and returns an iterator over a tuple of index and value. For example:
x = ['a', 'b', 'c']
print(list(enumerate(x)))
yields [("a", 0), ("b", 1), ("c", 2)].
If you want to start with 1, then use enumerate(x, 1).
You can use enumerate
dictionary = {
i: [cell.value for cell in row[0:]]
for i, row in enumerate(vSheet.rows)
}
If you just want the cell values this is easy:
d = {}
for idx, row in enumerate(ws.values, start=1):
row_s = sorted(row) # created a sorted copy for comparison
d[idx] = [row_s.index(v) + 1 for v in row]

Append name of dataframes to list python

I have 10 dataframes (ex: dfc,df1,df2,df3,df4,dft1,dft2,dft3,dft4,dft5). I want to check the length of each dataframe. If the length of dataframe is less than 2, I want to add the name of that dataframe to an empty list. How can I do this?
You can store the dataframes in a dictionary using their names as keys and then iterate over the dictionary:
dic = {'df1': df1,'df2': df2,'df3': df3,'df4': df4}
d = []
for k, v in dic.items():
if len(v) < 2:
d.append(k)
print(d)
You can also use aa list comprehension instead of the for loop:
dic = {'df1': df1,'df2': df2,'df3': df3,'df4': df4}
d = [k for k, v in dic.items() if len(v) < 2]
If I understand you correct, you want to create a list of the short dataframe-list.
I would do it like this:
dataframes = ['d','df1','df2','df3','df4','dft1','dft2','dft3','dft4','dft5']
short_dataframe = [] # the empy list.
for frame in dataframes:
if len(frame) < 2:
short_dataframe.append(frame) # adds the frame to the empty list
print(short_dataframe)
result of the print = ['d']

how to capture the results of search python dictionary

#I have a data frame like this, what i am trying to do is to search the Description column it see if it contain the string in my dictionary by using for loops. the results look good for me but have do not know how to save it to data frame or list or any sort of file i can export it :
import pandas as pd
data = {'ID': ['1', '2'],
'Description': ['there is a good book which is best for kids.', 'there is a bad book which worst for kids.'],
}
df = pd.DataFrame (data, columns = ['ID','Description'])
myDict={'A':{'best', 'good'}, 'D':{'bad','worst'}}
for i in range(len(df)):
for key, val in myDict.items():
for item in val:
if item in df['Description'][i]:
print(item)
print(i)
good
0
best
0
bad
1
worst
1
###output should like this. how do i create a dataframe or list to capture the results
#0 good best
#1 bad worst
Instead of printing the matches, append them to a list containing the matches for the current row of the dataframe. Then append the list that row of results to the list of results.
result = []
for i in range(len(df)):
row = [i]
for key, val in myDict.items():
for item in val:
if item in df['Description'][i]:
row.append(item)
result.append(row)
If I understand correctly, you want to aggregate the results in some data structure even as a list of tuples potentially? I added two lines to your code snippet:
results = []
import pandas as pd
data = {'ID': ['1', '2'],
'Description': ['there is a good book which is best for kids.', 'there is a bad book which worst for kids.'],
}
df = pd.DataFrame (data, columns = ['ID','Description'])
myDict={'A':{'best', 'good'}, 'D':{'bad','worst'}}
results = [] # aggregate results into a list
for i in range(len(df)):
for key, val in myDict.items():
for item in val:
if item in df['Description'][i]:
print(item)
print(i)
results.append((item, i)) # results[("good", 0), ("best", 0), ...]
# You can print them out like this
for x, y in results:
print("{} {}".format(x,y))

How to remove one dictionary from dataframe

I have the following dataframe:
And I made dictionaries from each unique appId as you see below:
with this command:
dfs = dict(tuple(timeseries.groupby('appId')))
After that I want to remove all dictionaries which have less than 30 rows from my dataframe. I removed those dictionaries from my dictionaries(dfs) and then I tried this code:
pd.concat([dfs]).drop_duplicates(keep=False)
but it doesn't work.
I believe you need transform size and then filter by boolean indexing:
df = pd.concat([dfs])
df = df[df.groupby('appId')['appId'].transform('size') >= 30]
#alternative 1
#df = df[df.groupby('appId')['appId'].transform('size').ge(30)]
#alternative 2 (slowier in large data)
#df = df.groupby('appId').filter(lambda x: len(x) >= 30)
Another approach is filter dictionary:
dfs = {k: v for k, v in dfs.items() if len(v) >= 30}
EDIT:
timeseries = timeseries[timeseries.groupby('appId')['appId'].transform('size') >= 30]
dfs = dict(tuple(timeseries.groupby('appId')))

Categories

Resources