in pandas i have two series of x rows and i want to add a column in which i get the rolling count of times that the value in col1 appears from the first row till the x-1 one.
The df is like this:
col1 col2
0 B A
1 B C
2 A B
3 A B
4 A C
5 B A
The desired output is
col1 col2 freq
0 B A 0
1 B C 1
2 A B 1
3 A B 2
4 A C 3 #A appears 3 times in the two columns from row 0 to 3
5 B A 4 #B appears 4 times in the two columns from row 0 to 4
Thanks in advance from a beginner,
G
Let use some dataframe reshaping, groupby and cumcount:
dfs = df.stack()
df['freq'] = dfs.groupby(dfs).cumcount().unstack()['col1']
print(df)
Output:
col1 col2 freq
0 B A 0
1 B C 1
2 A B 1
3 A B 2
4 A C 3
5 B A 4
This will solve irrespective of number of columns in the df
import pandas as pd
import numpy as np
def add(d1,d2):
# adding two dictionary
for i in d2.keys():
if i in d1.keys():
d1[i] = d1[i] +d2[i]
else:
d1[i] = d2[i]
return d1
if __name__ == '__main__':
counts = {}
df = pd.DataFrame({"a":[1, 2, 3, 1, 2], "b":[2, 1, 2, 3, 1]})
col = list(df)
for ind, it in df.iterrows():
unique,count = np.unique(it,return_counts=True)
unique_dict = dict(zip(unique, count))
counts = add(counts,unique_dict)
df.loc[ind, "freq"] = counts[it[col[0]]]
df["freq"] =df["freq"]-1
from collections import defaultdict
def fn():
d1, d2 = defaultdict(int), defaultdict(int)
x = yield
while True:
x = yield d1[x.col1] + d2[x.col1]
d1[x.col1] += 1
d2[x.col2] += 1
f = fn()
next(f)
df['freq'] = df[['col1', 'col2']].apply(lambda x: f.send(x), axis=1)
print(df)
Prints:
col1 col2 freq
0 B A 0
1 B C 1
2 A B 1
3 A B 2
4 A C 3
5 B A 4
EDIT (solution for arbitrary number of columns):
from collections import defaultdict
def fn(cols):
dd = [defaultdict(int) for _ in cols]
x = yield
while True:
x = yield sum(d[x[0]] for d in dd)
for i, d in enumerate(dd):
d[x[i]] += 1
cols = ['col1', 'col2']
f = fn(cols)
next(f)
df['freq'] = df[cols].apply(lambda x: f.send(x), axis=1)
print(df)
Related
I have a df with a number or columns, I would like to rename these with incremental numbers selecting the starting and ending range.
With the above image, I would like to select only columns B-D, and rename them to 1-4. So the resulting data frame would be:
So basically selecting the headers via index numbers and adding incremental numbers instead.
EDIT: The above dataframe
data = [['a','b','c','d','e','f'], ['a','b','c','d','e','f'], ['a','b','c','d','e','f'],['a','b','c','d','e','f']]
df = pd.DataFrame(data, columns = ['A','B','C','D','E','F'])
Use rename with selected columns by DataFrame.loc - here between B and E:
c = df.loc[:, 'B':'E'].columns
df = df.rename(columns=dict(zip(c, range(1, len(c) + 1))))
print (df)
A 1 2 3 4 F
0 a b c d e f
1 a b c d e f
2 a b c d e f
3 a b c d e f
If this is something you have to do frequently, you can write a custom function for that:
def col_rename(df, start, stop, inplace=False):
cols = list(df.loc[:, start:stop].columns)
new_cols = df.columns.map(lambda x: {k:v for v,k in
enumerate(cols, start=1)
}.get(x, x))
if inplace:
df.columns = new_cols
else:
return new_cols
df.columns = col_rename(df, 'B', 'F')
# or
# col_rename(df, 'B', 'F', inplace=True)
output:
A 1 2 3 4 F
0 0 1 2 3 4 5
used input:
df = pd.DataFrame([range(6)], columns=list('ABCDEF'))
# A B C D E F
# 0 0 1 2 3 4 5
I've found how to remove column with zeros for all the rows using the command df.loc[:, (df != 0).any(axis=0)], and I need to do the same but given the row number.
For example, for the folowing df
In [75]: df = pd.DataFrame([[1,1,0,0], [1,0,1,0]], columns=['a','b','c','d'])
In [76]: df
Out[76]:
a b c d
0 1 1 0 0
1 1 0 1 0
Give me the columns with non-zeros for the row 0 and I would expect the result:
a b
0 1 1
And for the row 1 get:
a c
1 1 1
I tried a lot of combinations of commands but I couldn't find a solution.
UPDATE:
I have a 300x300 matrix, I need to better visualize its result.
Below a pseudo-code trying to show what I need
for i in range(len(df[rows])):
_df = df.iloc[i]
_df = _df.filter(remove_zeros_columns)
print('Row: ', i)
print(_df)
Result:
Row: 0
a b
0 1 1
Row: 1
a c f
1 1 5 10
Row: 2
e
2 20
Best Regards.
Kleyson Rios.
You can change data structure:
df = df.reset_index().melt('index', var_name='columns').query('value != 0')
print (df)
index columns value
0 0 a 1
1 1 a 1
2 0 b 1
5 1 c 1
If need new column by values joined by , compare values for not equal by DataFrame.ne and use matrix multiplication by DataFrame.dot:
df['new'] = df.ne(0).dot(df.columns + ', ').str.rstrip(', ')
print (df)
a b c d new
0 1 1 0 0 a, b
1 1 0 1 0 a, c
EDIT:
for i in df.index:
row = df.loc[[i]]
a = row.loc[:, (row != 0).any()]
print ('Row {}'.format(i))
print (a)
Or:
def f(x):
print ('Row {}'.format(x.name))
print (x[x!=0].to_frame().T)
df.apply(f, axis=1)
Row 0
a b
0 1 1
Row 1
a c
1 1 1
df = pd.DataFrame([[1, 1, 0, 0], [1, 0, 1, 0]], columns=['a', 'b', 'c', 'd'])
def get(row):
return list(df.columns[row.ne(0)])
df['non zero column'] = df.apply(lambda x: get(x), axis=1)
print(df)
also if you want single liner use this
df['non zero column'] = [list(df.columns[i]) for i in df.ne(0).values]
output
a b c d non zero column
0 1 1 0 0 [a, b]
1 1 0 1 0 [a, c]
I think this answers your question more strictly.
Just change the value of given_row as needed.
given_row = 1
mask_all_rows = df.apply(lambda x: x!=0, axis=0)
mask_row = mask_all_rows.loc[given_row]
cols_to_keep = mask_row.index[mask_row == True].tolist()
df_filtered = df[cols_to_keep]
# And if you only want to keep the given row
df_filtered = df_filtered[df_filtered.index == given_row]
So far, I have this code that adds a row of zeros every other row (from this question):
import pandas as pd
import numpy as np
def Add_Zeros(df):
zeros = np.where(np.empty_like(df.values), 0, 0)
data = np.hstack([df.values, zeros]).reshape(-1, df.shape[1])
df_ordered = pd.DataFrame(data, columns=df.columns)
return df_ordered
Which results in the following data frame:
A B
0 a a
1 0 0
2 b b
3 0 0
4 c c
5 0 0
6 d d
But I need it to add the row of zeros every 2nd row instead, like this:
A B
0 a a
1 b b
2 0 0
3 c c
4 d d
5 0 0
I've tried altering the code, but each time, I get an error that says that zeros and df don't match in size.
I should also point out that I have a lot more rows and columns than I wrote here.
How can I do this?
Option 1
Using groupby
s = pd.Series(0, df.columns)
f = lambda d: d.append(s, ignore_index=True)
grp = np.arange(len(df)) // 2
df.groupby(grp, group_keys=False).apply(f).reset_index(drop=True)
A B
0 a a
1 b b
2 0 0
3 c c
4 d d
5 0 0
Option 2
from itertools import repeat, chain
v = df.values
pd.DataFrame(
np.row_stack(list(chain(*zip(v[0::2], v[1::2], repeat(z))))),
columns=df.columns
)
A B
0 a a
1 b b
2 0 0
3 c c
4 d d
5 0 0
I have the following dataframe, where '' is considered as empty:
df = pd.DataFrame({1: ['a', 'b', 'c']+ ['']*2, 2: ['']*2+ ['d','e', 'f']})
1 2
0 a ''
1 b ''
2 c d
3 '' e
4 '' f
How can I merge/join/combine (I don't know the correct term) col2 into col1 so that I have:
1 2
0 a ''
1 b ''
2 c d
3 e ''
4 f ''
or if I decide to merge col1 into col2:
1 2
0 '' a
1 '' b
2 c d
3 '' e
4 '' f
I would like to be able to decide in which col to merge and the other col should contain the conflict values.
Thank you in advance
You can also use the combine_first method for a vectorized (and simpler) version:
df[1].replace('', np.nan).combine_first(df[2])
results in:
0 a
1 b
2 c
3 e
4 f
You could also get both columns at once:
df.replace('', np.nan).combine_first(df.rename(columns={1: 2, 2: 1}))
results in:
1 2
0 a a
1 b b
2 c d
3 e e
4 f f
You can do this using the dataframe method apply():
Sample data:
df
1 2
0 a
1 b
2 c d
3 e
4 f
Define arbitrary variables:
merge_to_column = 2
other_column = 1
Use apply:
df['output'] = df.apply(lambda x: x[other_column] if x[merge_to_column] == '' else x[merge_to_column], axis=1)
Output:
df
1 2 output
0 a a
1 b b
2 c d d
3 e e
4 f f
def merge(col1, col2):
for x in range(len(col1)):
if col1[x] == '':
col1[x] = col2[x]
col2[x] = ''
This function will merge values from col2 into col1 where it finds quote marks, assuming both columns are the same size. You can handle different sizes as needed.
You can use .fillna():
df[1] = df[1].fillna(df[2])
then you take out the values from df[2] take collide:
df[2] = [None if r[1] == r[2] else r[2] for _, r in df.iterrows()]
output:
1 2
0 a None
1 b None
2 c d
3 e None
4 f None
Note that instead of using '' for empty values, you have to use None in this case:
df = pd.DataFrame({1: ['a', 'b', 'c']+[None]*2, 2: [None]*2+['d','e', 'f']})
I have two pandas dataframes, and I would like to combine each second dataframe row with each first dataframe row like this:
First:
val1 val2
1 2
0 0
2 1
Second:
l1 l2
a a
b c
Result (expected result size = len(first) * len(second)):
val1 val2 l1 l2
1 2 a a
1 2 b c
0 0 a a
0 0 b c
2 1 a a
2 1 b b
They have no same index.
Regards,
Secau
Create a surrogate key to do a cartesian join between them...
import pandas as pd
df1 = pd.DataFrame({'A': [1, 0, 2],
'B': [2, 0, 1],
'tmp': 1})
df2 = pd.DataFrame({'l1': ['a', 'b'],
'l2': ['a', 'c'],
'tmp': 1})
print pd.merge(df1, df2, on='tmp', how='outer')
Result:
A B tmp l1 l2
0 1 2 1 a a
1 1 2 1 b c
2 0 0 1 a a
3 0 0 1 b c
4 2 1 1 a a
5 2 1 1 b c
Here's an alternate solution:
import pandas as pd
df1 = pd.DataFrame({'val1': [1,0,2], 'val2': [2,0,1])
df2 = pd.DataFrame({'l1': ['a', 'b'], 'l2': ['a', 'c'])
df_list = []
for x in df1.index:
series = df1.iloc[x, :]
series_list = [series for _ in range(len(df2.index))]
temp_df = pd.DataFrame(series_list, index=range(len(df2.index)))
df_list.append(pd.concat((temp_df, df2), axis=1, join='inner'))
final_df = pd.concat(df_list)
Which produces:
final_df
val1 val2 l1 l2
0 1 2 a a
1 1 2 b c
0 0 0 a a
1 0 0 b c
0 2 1 a a
1 2 1 b c