original df
list1 = ['apple','lemon']
list2 = [[('taste','sweet'),('sweetness','5')],[('taste','sour'),('sweetness','0')]]
df = pd.DataFrame(list(zip(list1,list2)), columns=['fruit', 'description'])
df.head()
desired output
list3 = ['apple','lemon']
list4 = ['sweet','sour']
list5 = ['5','0']
df2 = pd.DataFrame(list(zip(list3,list4,list5)), columns=['fruit', 'taste', 'sweetness'])
df2.head()
what had i tried, but this seem 'weird', by trying to remove punctuation one by one, then only convert to row
df['description'] = df['description'].astype(str)
df['description'] = df['description'].str[1:-1]
df['description'] = df['description'].str.replace("(","")
df.head()
is there a better way to convert the list to desired row and column?
Thanks
Create dictionaries from tuples and pass to DataFrame constructor with DataFrame.pop for extract column, last append to original by DataFrame.join:
L = [dict(y) for y in df.pop('description')]
df = df.join(pd.DataFrame(L, index=df.index))
print (df)
fruit taste sweetness
0 apple sweet 5
1 lemon sour 0
Related
I have a dictionary of dataframes df_dict. I then have a substring "blue". I want to identify the name of the dataframe in my dictionary of dataframes that has at least one column that has a name containing the substring "blue".
I am thinking of trying something like:
for df in df_dict:
if df.columns.contains('blue'):
return df
else:
pass
However, I am not sure if a for loop is necessary here. How can I find the name of the dataframe I am looking for in my dictionary of dataframes?
I think loops are necessary for iterate items of dictionary:
df1 = pd.DataFrame({"aa_blue": [1,2,3],
'col':list('abc')})
df2 = pd.DataFrame({"f": [1,2,3],
'col':list('abc')})
df3 = pd.DataFrame({"g": [1,2,3],
'bluecol':list('abc')})
df_dict = {'df1_name' : df1, 'df2_name' : df2, 'df3_name' : df3}
out = [name for name, df in df_dict.items() if df.columns.str.contains('blue').any()]
print (out)
['df1_name', 'df3_name']
Or:
out = [name for name, df in df_dict.items() if any('blue' in y for y in df.columns)]
print (out)
['df1_name', 'df3_name']
For list of DataFrames use:
out = [df for name, df in df_dict.items() if df.columns.str.contains('blue').any()]
out = [df for name, df in df_dict.items() if any('blue' in y for y in df.columns)]
print (out)
[ aa_blue col
0 1 a
1 2 b
2 3 c, g bluecol
0 1 a
1 2 b
2 3 c]
I just asked a similar question rename columns according to list which has a correct answer for how to add suffixes to column names correctly. But i have a new issue. I want to rename the actual index name for the columns per dataframe. I have three lists of data frames (some of the data frames contain duplicate column index names (and actual data frame names as well - but thats not the issue, the issue is the duplicated original column.names). I simply want to append a suffix to each dataframe.column.name within each list, with a name in the suffix list, based on its numeric order.
here is an example of the data and the output i would like:
# add string to end of x in list of dfs
df1, df2, df3, df4 = (pd.DataFrame(np.random.randint(0,10,size=(10, 2)), columns=('a', 'b')),
pd.DataFrame(np.random.randint(0,10,size=(10, 2)), columns=('c', 'd')),
pd.DataFrame(np.random.randint(0,10,size=(10, 2)), columns=('e', 'f')),
pd.DataFrame(np.random.randint(0,10,size=(10, 2)), columns=('g', 'h')))
df1.columns.name = 'abc'
df2.columns.name = 'abc'
df3.columns.name = 'efg'
df4.columns.name = 'abc'
cat_a = [df2, df1]
cat_b = [df3, df2, df1]
cat_c = [df1]
dfs = [cat_a, cat_b, cat_c]
suffix = ['group1', 'group2', 'group3']
# expected output =
#for df in cat_a: df.columns.name = df.columns.name + 'group1'
#for df in cat_b: df.columns.name = df.columns.name + 'group2'
#for df in cat_c: df.columns.name = df.columns.name + 'group3'
and here is some code that i have written that doesn't work - where df.column.names are duplicated across data frames, multiple suffixes are appended
for x, df in enumerate(dfs):
for i in df:
n = ([(i.columns.name + '_' + str(suffix[x])) for out in i.columns.name])
i.columns.name=n[x]
thank you for looking, i really appreciate it
Your current code is not working as you have multiple references to the same df in your lists, so only the last change matters. You need to make copies.
Assuming you want to change the columns index name for each df in dfs, you can use a list comprehension:
dfs = [[d.rename_axis(suffix[i], axis=1) for d in group]
for i,group in enumerate(dfs)]
output:
>>> dfs[0][0]
group1 c d
0 5 0
1 9 3
2 3 9
3 4 2
4 1 0
5 7 6
6 5 2
7 8 0
8 1 2
9 7 2
Dataframe 1
df1 = pd.DataFrame([[1221,"aptq",[{'id': 100051}, {'id': 100050}]]], columns = ["offid","name","sub_ids"])
offid name sub_ids
0 1221 aptq [{'id': 100051}, {'id': 100050}]
Dataframe 2
df2 = pd.DataFrame([[100051, "zonal"], [100050, "upper"],
[100056, "capital | national"]], columns=["id", "name"])
id name
0 100051 zonal
1 100050 upper
2 100056 capital | national
Result DataFrame
offid name sub_ids
1221 aptq [zonal, upper]
Want to replace the values in sub_ids column of Dataframe 1 with the name of the id in DataFrame 2 to achieve result like Result Dataframe. Any Help will be appreciated.
Use Series.explode on sub_ids, then use Series.str.get on column sub_ids to extract the values associated with key id from the dictionary, then using Series.map map the ids to names in df2, and use Series.groupby on level=0 and agg using list:
names = (
df1['sub_ids'].explode().str.get('id')
.map(df2.set_index('id')['name']).groupby(level=0).agg(list)
)
df = df1.assign(sub_ids=names)
Result:
print(df)
offid name sub_ids
0 1221 aptq [zonal, upper]
You can use the following:
df1 = pd.DataFrame([[1221,"aptq",[{'id': 100051}, {'id': 100050}]]], columns = ["offid","name","sub_ids"])
df2 = pd.DataFrame([[100051,"zonal"],
[100050,"upper"],
[100056,"capital | national"]], columns = [ "id","name"])
df2 = df2.set_index("id").T.to_dict(orient='records')[0]
Now we just create a list and look for it in the dictionary:
df1["sub_ids"] = df1["sub_ids"].apply(lambda row: [item for sublist in [list(row[i].values()) for i in range(len(row))] for item in sublist] if len(row) > 0 else "-")
df1["sub_ids"] = df1["sub_ids"].apply(lambda row: [df2[row[i]] for i in range(len(row))] if len(row)>0 else "-")
df1
offid name sub_ids
0 1221 aptq [zonal, upper]
I am attempting to concatenate four dataframes with identical indexes but different Multi-indexed columns.
When I use:
df = pd.concat([df_wa, df_avg, df_sum, df_count], sort=False)
and then write to excel, the DataFrame prints the Multi Index with the first index on the top row of Excel and the other index as the row below.
cash ... fees ... apr
WA ... count ... avg
However, the axis needs to be set as axis=1 so that the indexes don't repeat for every new dataframe. So i use:
df = pd.concat([df_wa, df_avg, df_sum, df_count], axis=1, sort=False)
But then I lose the multi index as it is converted to a tuple.
(free_cash_flow, WA) ... (fees_received, count)....(apr, avg)
I have tried to use levels, keys, and unstack with no success. Any idea how to un-flatten the index?
if you set the multi-index using df.columns = multidex values it works out fine. They just have to have the same number of levels
example:
array1 = ['animal', 'animal']
array2 = ['has wings', 'no wings']
tuples = list(zip(array1, array2))
col = pd.MultiIndex.from_tuples(tuples)
df = pd.DataFrame([['bird', 'monkey'], ['dragon','horse']])
df.columns = col
array1 = ['foo', 'foo']
array2 = ['baz', 'bar']
tuples = list(zip(array1, array2))
col = pd.MultiIndex.from_tuples(tuples)
df2 = pd.DataFrame([['bird', 'monkey'], ['dragon','horse']])
df2.columns = col
new_df = pd.concat([df, df2], axis=1)
print(new_df)
output:
animal foo
has wings no wings baz bar
0 bird monkey bird monkey
1 dragon horse dragon horse
I have a dataframe with unique value in each columns:
df1 = pd.DataFrame([["Phys","Shane","NY"],["Chem","Mark","LA"],
["Maths","Jack","Mum"],["Bio","Sam","CT"]],
columns = ["cls1","cls2","cls3"])
print(df1)
cls1 cls2 cls3
0 Phys Shane NY
1 Chem Mark LA
2 Maths Jack Mum
3 Bio Sam CT
And a list l1:
l1=["Maths","Bio","Shane","Mark"]
print(l1)
['Maths', 'Bio', 'Shane', 'Mark']
Now I want to retrieve a columns from dataframe that contains elements from list and list of elements.
Expected Output:
{'cls1' : ['Maths','Bio'], 'cls2': ['Shane','Mark']}
The code I have:
cls = []
for cols in df1.columns:
mask = df1[cols].isin(l1)
if mask.any():
cls.append(cols)
print(cls)
The output of above code:
['cls1', 'cls2']
I'm struggling to get common elements from dataframe and list to convert it into dictionary.
Any suggestions are welcome.
Thanks.
Use DataFrame.isin for mask, replace non match values by indexing and reshape with stack:
df = df1[df1.isin(l1)].stack()
print (df)
0 cls2 Shane
1 cls2 Mark
2 cls1 Maths
3 cls1 Bio
dtype: object
Last create list by dict comprehension:
d = {k:v.tolist() for k,v in df.groupby(level=1)}
print(d)
{'cls2': ['Shane', 'Mark'], 'cls1': ['Maths', 'Bio']}
Another solution:
d = {}
for cols in df1.columns:
mask = df1[cols].isin(l1)
if mask.any():
d[cols] = df1.loc[mask, cols].tolist()
print(d)
{'cls2': ['Shane', 'Mark'], 'cls1': ['Maths', 'Bio']}