can anyone advise how to loop over every Nth item in a dictionary?
Essentially I have a dictionary of dataframes and I want to be able to create a new dictionary based on every 3rd dataframe item (including the first) based on index positioning of the original. Once I have this I would like to concatenate the dataframes together.
So for example if I have 12 dataframes , I would like the new dataframe to contain the first,fourth,seventh,tenth etc..
Thanks in advance!
if the dict is required, you may use tuple of dict keys:
custom_dict = {
'first': 1,
'second': 2,
'third': 3,
'fourth': 4,
'fifth': 5,
'sixth': 6,
'seventh': 7,
'eighth': 8,
'nineth': 9,
'tenth': 10,
'eleventh': 11,
'twelveth': 12,
}
for key in tuple(custom_dict)[::3]:
print(custom_dict[key])
then, you may call pandas.concat:
df = pd.concat(
[
custom_dict[key]
for key in tuple(custom_dict)[::3]
],
# =========================================================================
# axis=0 # To Append One DataFrame to Another Vertically
# =========================================================================
axis=1 # To Append One DataFrame to Another Horisontally
)
assuming custom_dict[key] returns pandas.DataFrame, not int as in my code above.
What you ask it a bit strange. Anyway, you have two main options.
convert your dictionary values to list and slice that:
out = pd.concat(list(dfs.values())[::3])
output:
a b c
0 x x x
0 x x x
0 x x x
0 x x x
slice your dictionary keys and generate a subdictionary:
out = pd.concat({k: dfs[k] for k in list(dfs)[::3]})
output:
a b c
df1 0 x x x
df4 0 x x x
df7 0 x x x
df10 0 x x x
Used input:
dfs = {f'df{i+1}': pd.DataFrame([['x']*3], columns=['a', 'b', 'c']) for i in range(12)}
Related
I have the following dataframe:
c1 c2 freq
0 a [u] [4]
1 b [x, z, v] [8, 3, 15]
I want to get another column "dict" such that
c1 c2 freq dict
0 a [u] [4] {'u':4}
1 b [x, z, v] [8, 3, 15] {'x':8, 'z':3, 'v':15}
I'm trying this code: d["dict"] = d.apply(lambda row: dict(zip(row["c2"], row["freq"]))) but this gives the error:
KeyError: ('c2', u'occurred at index c1')
Not sure what I'm doing wrong. The whole exercise is that I have a global dictionary defined like this: {"u":4, "v":15, "x":8, "z":3} and my initial dataframe is:
c1 c2
0 a u
1 b [x, z, v]
where the [x, z, v] is a numpy array. For each row, I want to retain the top 2 elements (if it's an array) with the highest values from the global dictionary, so for the second row I'll retain x and v. To that end, I converted each element of c2 column into a list, created a new column with their respective frequencies and now want to convert into a dictionary so that I can sort it by values. Then I'll retain the top 2 keys of the dictionary of that row.
d["c2"] = d["c2"].apply(lambda x: list(set(x)))
d["freq"] = d["c2"].apply(lambda x: [c[j] for j in x])
d["dict"] = d.apply(lambda row: dict(zip(row["c2"], row["freq"])))
The third line is causing a problem. Also, if there's a more efficient procedure to do the whole thing, I'd be glad for any advice. Thanks!
Use list comprehension:
df['dict'] = [dict(zip(a,b)) for a, b in zip(df['c2'], df['freq'])]
print (df)
c1 c2 freq dict
0 a [u] [4] {'u': 4}
1 b [x, z, v] [8, 3, 15] {'x': 8, 'z': 3, 'v': 15}
Or in your solution add axis=1 for processing per rows:
df["dict"] = df.apply(lambda row: dict(zip(row["c2"], row["freq"])), axis=1)
You can solve your core problem more easily by using the key and reverse arguments of the sorted built-in. You siply prepare a partial func and map it over the column along with your preferred subsetting func in method chaining style:
import pandas as pd
from functools import partial
df = pd.DataFrame({'c1': ['a', 'b'], 'c2': ['u', ['x','z','v']]})
c = {"u":4, "v":15, "x":8, "z":3}
sorter = partial(sorted, key=lambda x: c[x], reverse=True)
def subset(l):
return l[:2]
df['highest_two'] = df['c2'].map(sorter).map(subset)
print(df)
"""
Out:
c1 c2 highest_two
0 a u [u]
1 b [x, z, v] [v, x]
"""
In the code below, I am iterating groups of groupby object and printing the first item in column
b of each group.
import pandas as pd
d = {
'a': [1, 2, 3, 4, 5, 6],
'b': [10, 20, 30, 10, 20, 30],
}
df = pd.DataFrame(d)
groups = df.groupby('b')
for name, group in groups:
first_item_in_b = group['b'].tolist()[0]
print(first_item_in_b)
Because the groupby has hierarchical index, in order to pick the first element in b I need to
convert b to list first.
How can I avoid such overhead?
I cannot just remove tolist() like so:
first_item_in_b = group['b'][0]
because it will give KeyError.
You can use Index.get_loc for get position of column b, so possible use iat or iloc only or by first value of index with column name by Series.at.
Or is possible select by position by Series.iat or Series.iloc after selecting by column label b:
for name, group in groups:
#first value by positions from columns names
first_item_in_b = group.iat[0, group.columns.get_loc('b')]
#first value by labels from index
first_item_in_b = group.at[group.index[0],'b']
#fast select first value
first_item_in_b = group['b'].iat[0]
#alternative
first_item_in_b = group['b'].iloc[0]
print(first_item_in_b)
10
20
30
Using iloc:
import pandas as pd
d = {
'a': [1, 2, 3, 4, 5, 6],
'b': [10, 20, 30, 10, 20, 30],
}
df = pd.DataFrame(d)
groups = df.groupby('b')
for name, group in groups:
first_item_in_b = group['b'].iloc[0]
print(first_item_in_b)
OUTPUT:
10
20
30
EDIT:
Or Using the Fast integer location scalar accessor.
I have a pandas variable X which has a shape of (14931, 381).
That's 14,931 examples, with each example having 381 features. I want to add 483 features (each with a zero) value to each example, except I want them to be before the 381 existing ones
How can this be done?
Create a DataFrame of zeros and call pd.concat.
v = pd.DataFrame(0, index=df.index, columns=range(483))
df = pd.concat([v, df], axis=1)
For demonstration purpose let's set up a smaller DataFrame
(7 rows and 2 columns, with feature (column) names f1, f2, ...):
df = pd.DataFrame(data={'f1': [ 1, 4, 6, 5, 7, 2, 3 ],
'f2': [ 4, 6, 5, 0, 2, 3, 2 ]})
Then, let's create a DataFrame filled with zeroes, to be
prepended to df (3 columns instead of your 483):
zz = pd.DataFrame(data=np.zeros((df.shape[0], 3), dtype=int),
columns=[ 'p' + str(n + 1) for n in range(3) ], index=df.index)
As you can see:
I named the "new" columns as p1, p2 and so on,
the index is a copy of the index in df (it will be important
at the next stage).
And the last step is to join these 2 DataFrames and substitute under
df:
df = zz.join(df)
The last step for you is to change the number of added columns to the
proper value.
I am using pandas to incrementally find out new elements i.e. for every row, I'd see whether values in list have been seen before. If they are, we will ignore them. If not, we will select them.
I was able to do this using row.iterrows(), but I have >1M rows, so I believe vectorized apply might be better.
Here's sample data and code. Once you run this code, you will get expected output:
from numpy import nan as NA
import collections
df = pd.DataFrame({'ID':['A','B','C','A','B','A','A','A','D','E','E','E'],
'Value': [1,2,3,4,3,5,2,3,7,2,3,9]})
#wrap all elements by group in a list
Changed_df=df.groupby('ID')['Value'].apply(list).reset_index()
Changed_df=Changed_df.rename(columns={'Value' : 'Elements'})
Changed_df=Changed_df.reset_index(drop=True)
def flatten(l):
for el in l:
if isinstance(el, collections.Iterable) and not isinstance(el, (str, bytes)):
yield from flatten(el)
else:
yield el
Changed_df["Elements_s"]=Changed_df['Elements'].shift()
#attempt 1: For loop
Changed_df["Diff"]=NA
Changed_df["count"]=0
Elements_so_far = []
#replace NA with empty list in columns that will go through list operations
for col in ["Elements","Elements_s","Diff"]:
Changed_df[col] = Changed_df[col].apply(lambda d: d if isinstance(d, list) else [])
for idx,row in Changed_df.iterrows():
diff = list(set(row['Elements']) - set(Elements_so_far))
Changed_df.at[idx, "Diff"] = diff
Elements_so_far.append(row['Elements'])
Elements_so_far = flatten(Elements_so_far)
Elements_so_far = list(set(Elements_so_far)) #keep unique elements
Changed_df.loc[idx,"count"]=diff.__len__()
Commentary about the code:
I am not a fan of this code because it's clunky and inefficient.
I am saying inefficient because I have created Elements_s which holds shifted values. Another reason for inefficiency is for loop through rows.
Elements_so_far keeps track of all the elements we have discovered for every row. If there is a new element that shows up, we count that in Diff column.
We also keep track of the length of new elements discovered in count column.
I'd appreciate if an expert could help me with a vectorized version of the code.
I did try the vectorized version, but I couldn't go too far.
#attempt 2:
Changed_df.apply(lambda x: [i for i in x['Elements'] if i in x['Elements_s']], axis=1)
I was inspired from How to compare two columns both with list of strings and create a new column with unique items? to do above, but I couldn't do it. The linked SO thread does row-wise difference among columns.
I am using Python 3.6.7 by Anaconda. Pandas version is 0.23.4
You could using sort and then use numpy to get the unique indexes and then construct your groupings, e.g.:
In []:
df = df.sort_values(by='ID').reset_index(drop=True)
_, i = np.unique(df.Value.values, return_index=True)
df.iloc[i].groupby(df.ID).Value.apply(list)
Out[]:
ID
A [1, 2, 3, 4, 5]
D [7]
E [9]
Name: Value, dtype: object
Or to get close to your current output:
In []:
df = df.sort_values(by='ID').reset_index(drop=True)
_, i = np.unique(df.Value.values, return_index=True)
s1 = df.groupby(df.ID).Value.apply(list).rename('Elements')
s2 = df.iloc[i].groupby(df.ID).Value.apply(list).rename('Diff').reindex(s1.index, fill_value=[])
pd.concat([s1, s2, s2.apply(len).rename('Count')], axis=1)
Out[]:
Elements Diff Count
ID
A [1, 4, 5, 2, 3] [1, 2, 3, 4, 5] 5
B [2, 3] [] 0
C [3] [] 0
D [7] [7] 1
E [2, 3, 9] [9] 1
One alternative using drop duplicates and groupby
# Groupby and apply list func.
df1 = df.groupby('ID')['Value'].apply(list).to_frame('Elements')
# Sort values , drop duplicates by Value column then use groupby.
df1['Diff'] = df.sort_values(['ID','Value']).drop_duplicates('Value').groupby('ID')['Value'].apply(list)
# Use str.len for count.
df1['Count'] = df1['Diff'].str.len().fillna(0).astype(int)
# To fill NaN with empty list
df1['Diff'] = df1.Diff.apply(lambda x: x if type(x)==list else [])
Elements Diff Count
ID
A [1, 4, 5, 2, 3] [1, 2, 3, 4, 5] 5
B [2, 3] [] 0
C [3] [] 0
D [7] [7] 1
E [2, 3, 9] [9] 1
I've a list like this one:
categories_list = [
['a', array([ 12994, 1262824, 145854, 92469]),
'b', array([273300]),
'c', array([341395, 32857711])],
['a', array([ 356424311, 165573412, 2032850784]),
'b', array([2848105, 228835]),
'c', array([])],
['a', array([1431689, 30655043, 1739919]),
'b', array([597, 251911, 246600]),
'c', array([35590])]
]
where each array belongs to the letter before.
Example: a -> array([ 12994, 1262824, 145854, 92469]), b -> array([273300]), 'a' -> array([1431689, 30655043, 1739919]) and so on...
So, is it possible to retrieve the total items number for each letter?
Desiderata:
----------
a 10
b 6
c 3
All suggestions are welcome
pd.DataFrame(
[dict(zip(x[::2], [len(y) for y in x[1::2]])) for x in categories_list]
).sum()
a 10
b 6
c 3
dtype: int64
I'm aiming at creating a list of dictionaries. So I have to fill in the ...... with something that parses each sub-list with a dictionary
[ ...... for x in catgories_list]
If I use dict on a list or generator of tuples, it will magically turn that into a dictionary with keys as the first value in the tuple and values as the second value in the tuple.
dict(...list of tuples...)
zip will give me that generator of tuples
zip(list one, list two)
I know that in each sub-list, my keys are at the even indices [0, 2, 4...] and values are at the odd indices [1, 3, 5, ...]
# even odd
zip(x[::2], x[1::2])
but x[1::2] will be arrays, and I don't want the arrays. I want the length of the arrays.
# even odd
zip(x[::2], [len(y) for y in x[1::2]])
pandas.DataFrame will take a list of dictionaries and create a dataframe.
Finally, use sum to count the lengths.
I use groupby in order to group key in column 0, 2, 4 (which has keys a, b, c respectively) and then count number of distinct item number in the next column. Number in the group in this case is len(set(group)) (or len(group) if you want just total length of the group). See the code below:
from itertools import groupby, chain
count_distincts = []
cols = [0, 2, 4]
for c in cols:
for gid, group in groupby(categories_list, key=lambda x: x[c]):
group = list(chain(*[list(g[c + 1]) for g in group]))
count_distincts.append([gid, len(set(group))])
Output [['a', 10], ['b', 6], ['c', 3]]