Pandas change rows in DataFrame without assignment - python

Say we have a Python Pandas DataFrame:
In[1]: df = pd.DataFrame({'A': [1, 1, 2, 3, 5],
'B': [5, 6, 7, 8, 9]})
In[2]: print(df)
A B
0 1 5
1 1 6
2 2 7
3 3 8
4 5 9
I want to change rows that match a certain condition. I know that this can be done via direct assignment:
In[3]: df[df.A==1] = pd.DataFrame([{'A': 0, 'B': 5},
{'A': 0, 'B': 6}])
In[4]: print(df)
A B
0 0 5
1 0 6
2 2 7
3 3 8
4 5 9
My question is: Is there an equivalent solution to the above assignment that would return a new DataFrame with the rows changed, i.e. a stateless solution? I'm looking for something like pandas.DataFrame.assign but which acts on rows instead of columns.

DataFrame.copy
df2 = df.copy()
df2[df.A == 1] = pd.DataFrame([{'A': 0, 'B': 5}, {'A': 0, 'B': 6}])
DataFrame.mask + fillna
m = df.A == 1
fill_df = pd.DataFrame([{'A': 0, 'B': 5}, {'A': 0, 'B': 6}], index=df.index[m])
df2 = df.mask(m).fillna(fill_df)

Related

Select a column based on name of a row (pandas) [duplicate]

This question already has an answer here:
Create a column that's a combination of other columns
(1 answer)
Closed 3 years ago.
I'm trying to select values of a column based on a value of a row using pandas.
For example:
names A B C D
A 1 0 2 0
A 2 1 0 0
C 1 0 4 0
A 2 0 0 0
D 0 1 2 5
As output I want something like:
name value
A 1
A 2
C 4
A 2
D 5
Is there a very fast (efficient) way to do such operation?
Try look at lookup
df.lookup(df.index,df.names)
Out[390]: array([1, 2, 4, 2, 5], dtype=int64)
#df['value']=df.lookup(df.index,df.names)
You can for example use the multi-purpose apply function:
import pandas
data = {
'names': ['A', 'A', 'C', 'A', 'D'],
'A': [1, 2, 1, 2, 0],
'B': [0, 1, 0, 0, 1],
'C': [2, 0, 4, 0, 2],
'D': [0, 0, 0, 0, 5],
}
df = pandas.DataFrame(data)
print(df)
def process_it(row):
return row[row['names']]
df['selection'] = df.apply(process_it, axis=1)
print(df[['names', 'selection']])

Pandas: how to remove duplicate rows, but keep ALL rows with max value [duplicate]

This question already has answers here:
Get the row(s) which have the max value in groups using groupby
(15 answers)
Closed 4 years ago.
How can I remove duplicate rows, but keep ALL rows with the max value. For example, I have a dataframe with 4 rows:
data = [{'a': 1, 'b': 2, 'c': 3},{'a': 7, 'b': 10, 'c': 2}, {'a': 7, 'b': 2, 'c': 20}, {'a': 7, 'b': 2, 'c': 20}]
df = pd.DataFrame(data)
From this dataframe, I want to have a dataframe like (3 rows, group by 'a', keep all rows that have max value in 'c'):
data = [{'a': 1, 'b': 2, 'c': 3}, {'a': 7, 'b': 2, 'c': 20}, {'a': 7, 'b': 2, 'c': 20}]
df = pd.DataFrame(data)
You can use GroupBy + transform with Boolean indexing:
res = df[df['c'] == df.groupby('a')['c'].transform('max')]
print(res)
a b c
0 1 2 3
1 7 2 20
2 7 2 20
You can calculate the max c per group using groupby and transform and then filter where your record is equal to the max like:
df['max_c'] = df.groupby('a')['c'].transform('max')
df[df['c']==df['max_c']].drop(['max_c'], axis=1)

Duplicating rows with certain value in a column

I have to duplicate rows that have a certain value in a column and replace the value with another value.
For instance, I have this data:
import pandas as pd
df = pd.DataFrame({'Date': [1, 2, 3, 4], 'B': [1, 2, 3, 2], 'C': ['A','B','C','D']})
Now, I want to duplicate the rows that have 2 in column 'B' then change 2 to 4
df = pd.DataFrame({'Date': [1, 2, 2, 3, 4, 4], 'B': [1, 2, 4, 3, 2, 4], 'C': ['A','B','B','C','D','D']})
Please help me on this one. Thank you.
You can use append, to append the rows where B == 2, which you can extract using loc, but also reassigning B to 4 using assign. If order matters, you can then order by C (to reproduce your desired frame):
>>> df.append(df[df.B.eq(2)].assign(B=4)).sort_values('C')
B C Date
0 1 A 1
1 2 B 2
1 4 B 2
2 3 C 3
3 2 D 4
3 4 D 4

Pandas: Compare every two rows and output result to a new dataframe

import pandas as pd
df1 = pd.DataFrame({'ID':['i1', 'i2', 'i3'],
'A': [2, 3, 1],
'B': [1, 1, 2],
'C': [2, 1, 0],
'D': [3, 1, 2]})
df1.set_index('ID')
df1.head()
A B C D
ID
i1 2 1 2 3
i2 3 1 1 1
i3 1 2 0 2
df2 = pd.DataFrame({'ID':['i1-i2', 'i1-i3', 'i2-i3'],
'A': [2, 1, 1],
'B': [1, 1, 1],
'C': [1, 0, 0],
'D': [1, 1, 1]})
df2.set_index('ID')
df2
A B C D
ID
i1-i2 2 1 1 1
i1-i3 1 1 0 1
i2-i3 1 1 0 1
Given a data frame as df1, I want to compare every two different rows, and get the smaller value at each column, and output the result to a new data frame like df2.
For example, to compare i1 row and i2 row, get new row i1-i2 as 2, 1, 1, 1
Please advise what is the best way of pandas to do that.
Try this:
from itertools import combinations
v = df1.values
r = pd.DataFrame([np.minimum(v[t[0]], v[t[1]])
for t in combinations(np.arange(len(df1)), 2)],
columns=df1.columns,
index=list(combinations(df1.index, 2)))
Result:
In [72]: r
Out[72]:
A B C D
(i1, i2) 2 1 1 1
(i1, i3) 1 1 0 2
(i2, i3) 1 1 0 1

Converting column names to variables in python

I am importing data from csv file and storing it in pandas dataframe. Here is the image from the csv file:
For each row I want a string like this:
Here is the code I am using to import data from the csv file and storing it in the dataframe:
import csv
import pandas as pd
filename ="../Desktop/venkat.csv"
df = pd.read_table(filename,sep=" ")
How can I achieve this?
Consider the dataframe df
df = pd.DataFrame(np.arange(10).reshape(-1, 5), columns=list('ABCDE'))
A B C D E
0 0 1 2 3 4
1 5 6 7 8 9
You can get a series of json strings for each row
df.apply(pd.Series.to_json, 1)
0 {"A":0,"B":1,"C":2,"D":3,"E":4}
1 {"A":5,"B":6,"C":7,"D":8,"E":9}
I think it's better to use a dict to save your data with to_dict:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
#select some row - e.g. with index 2
print (df.loc[2])
A 3
B 6
C 9
D 5
E 6
F 3
Name: 2, dtype: int64
d = df.loc[2].to_dict()
print (d)
{'E': 6, 'B': 6, 'F': 3, 'A': 3, 'C': 9, 'D': 5}
print (d['A'])
3
If ordering is important use OrderedDict:
from collections import OrderedDict
print (OrderedDict(df.loc[2]))
OrderedDict([('A', 3), ('B', 6), ('C', 9), ('D', 5), ('E', 6), ('F', 3)])
If you need all values in columns use DataFrame.to_dict:
d = df.to_dict(orient='list')
print (d)
{'E': [5, 3, 6], 'B': [4, 5, 6], 'F': [7, 4, 3],
'A': [1, 2, 3], 'C': [7, 8, 9], 'D': [1, 3, 5]}
print (d['A'])
[1, 2, 3]
d = df.to_dict(orient='index')
print (d)
{0: {'E': 5, 'B': 4, 'F': 7, 'A': 1, 'C': 7, 'D': 1},
1: {'E': 3, 'B': 5, 'F': 4, 'A': 2, 'C': 8, 'D': 3},
2: {'E': 6, 'B': 6, 'F': 3, 'A': 3, 'C': 9, 'D': 5}}
#get value in row 2 and column A
print (d[2]['A'])
3
import csv
csvfile = open('test.csv','r')
csvFileArray = []
for row in csv.reader(csvfile, delimiter = '\t'):
csvFileArray.append(row)
header =csvFileArray[0][0].split(',')
str_list=[]
for each in csvFileArray[1:]:
data= each[0].split(',')
local_list = []
for i in range(len(data)):
str_data =' '.join([header[i], '=', data[i]])
local_list.append(str_data)
str_list.append(local_list)
print str_list
Assume your csv is a comma delimited file.
import pandas as pd
filename ="../Desktop/venkat.csv"
df = pd.read_csv(filename,sep=",")
output = df.to_dict(orient='records')
print output #this would yield a list of dictionaries

Categories

Resources