I am cleaning some dataframes and want to replace a set of values with different values as shown below.
import pandas as pd
dftmp=pd.DataFrame({
'a':['yes','true','false','no','na', 'NA', 'TRUE'],
'b':['yes','true','false','no','FALSE','ofcourse','yes we can'],
'c':['any','other','random','column','in', 'the', 'db']
})
a b c
0 yes yes any
1 true true other
2 false false random
3 no no column
4 na FALSE in
5 NA ofcourse the
6 TRUE yes we can db
#replace with Y, N, NA. (actual combination of old and replacement values, and columns in which to replace is imported from another dataframe and will change over dataframes and time).
#next 3 variables populated from another database and can change.
cols = ['a','b']
lstold = [['Yes, True'], ['No, False'], ['NA']]
lstnew = ['Y', 'N', 'NA']
for col in cols:
dlsts = dict(zip(lstnew, lstold))
for key, val in dlsts.items():
try:
valsold = val.split(', ')
except:
print('single item list. continue')
for valold in valsold:
dftmp[col] = dftmp[col].replace(f'(?i){valold}', key, regex=True)
I've almost got the desired result - the issue is in 6b, where it should should remain 'Yes we can' instead of 'Y we can':
a b c
0 Y Y any
1 Y Y other
2 N N random
3 N N column
4 NA N in
5 NA ofcourse the
6 Y Y we can db
How do I stop the 'Yes' in 'Yes we can' from being replaced.
Can this be done without using 3 for loops? I fear it will take time a lot more time with my bigger datasets.
Thanks
You could try this:
lstold = [['Yes, True'], ['No, False'], ['NA']]
lstold = [ a.lower() for x in lstold for a in x]
lstnew = ['Y','Y','N', 'N', 'NA']
for i in df.columns:
df[i] = df[i].str.lower().replace(lstold,lstnew)
This is the output:
a b c
0 Y Y any
1 Y Y other
2 N N random
3 N N column
4 NA N in
5 NA ofcourse the
6 Y yes we can db
The sublists in your lstold aren't actually lists, they're a single string. I changed that in the sample I'm showing to make each string an element of the list. Assuming you can do that, perhaps this is what you are looking for.
import pandas as pd
dftmp=pd.DataFrame({
'a':['yes','true','false','no','na', 'NA', 'TRUE'],
'b':['yes','true','false','no','FALSE','ofcourse','yes we can'],
'c':['any','other','random','column','in', 'the', 'db']
})
cols = ['a','b']
lstold = [['Yes', 'True'], ['No', 'False'], ['NA']]
lstnew = ['Y', 'N', 'NA']
m = {}
for c,l in enumerate(lstold):
for s in l:
m[s.lower()] = lstnew[c]
for col in cols:
dftmp[col].update(dftmp[col].str.lower().map(m))
Output
a b c
0 Y Y any
1 Y Y other
2 N N random
3 N N column
4 NA N in
5 NA ofcourse the
6 Y yes we can db
Couldn't reduce to less than three for loops. but thanks to response to my question here was able to stop replacement of substrings:
cols = ['a','b']
lstold = [['Yes, True'], ['No, False'], ['NA']]
lstnew = ['Y', 'N', 'NA']
for col in cols:
dlsts = dict(zip(lstnew, lstold))
for key, val in dlsts.items():
try:
valsold = val.split(', ')
except:
print('single item list. continue')
for valold in valsold:
df[col] = df[col].replace(rf'(?i)^{valold}$', key, regex=True)
if one doesn't need to ignore case or worry about replacing substrings, then one of the for loops can be dropped as follows:
cols = ['a','b']
lstold = [['Yes, True'], ['No, False'], ['NA']]
lstnew = ['Y', 'N', 'NA']
for col in cols:
dlsts = dict(zip(lstnew, lstold))
for key, val in dlsts.items():
df[col] = df[col].str.replace(val, key, case=False, regex=False)
Related
I have a dataframe like this:
r_id
c_id
0
x
1
1
y
1
2
z
2
3
u
3
4
v
3
5
w
4
6
x
4
which you can reproduce like this:
import pandas as pd
r1 = ['x', 'y', 'z', 'u', 'v', 'w', 'x']
r2 = ['1', '1', '2', '3', '3', '4', '4']
df = pd.DataFrame([r1,r2]).T
df.columns = ['r_id', 'c_id']
Where a row has a duplicate r_id, I want to relabel all cases of that c_id with the first c_id value that was given for the duplicate r_id.
(Edit: maybe this is somewhat subtle, but I therefore want to relabel 'w's c_id as '1', as well as that belonging to the second case of 'x'. The duplication of 'x' shows me that all instances where c_id == '1' and c_id == '2' should have the same label.)
For a small dataframe, this works:
from collections import defaultdict
import networkx as nx
g = nx.from_pandas_edgelist(df, 'r_id', 'c_id')
subgraphs = [g.subgraph(c) for c in nx.connected_components(g)]
translator = {n: sorted(list(g.nodes))[0] for g in subgraphs for n in g.nodes if n in df.c_id.values}
df['simplified'] = df.c_id.apply(lambda x: translator[x])
so that I get this:
r_id
c_id
simplified
0
x
1
1
1
y
1
1
2
z
2
2
3
u
3
3
4
v
3
3
5
w
4
1
6
x
4
1
But I'm trying to do this for a table with 2.5 million rows and my computer is struggling... There must be a more efficient way to do something like this.
Okay, if I optimize my initial answer by just using the memory id() as a unique label for a connected set (or rather, a subgraph, since I'm using networkx to find these), and don't check any condition while I'm generating the dictionary but just use a .get() so that it passes gracefully past values that have no key, then this seems to work:
def simplify(original_df):
df = original_df.copy()
g = nx.from_pandas_edgelist(df, 'r_id', 'c_id')
subgraphs = [g.subgraph(c) for c in nx.connected_components(g)]
translator = {n: id(g) for g in subgraphs for n in g.nodes}
df['simplified'] = df.c_id.apply(lambda x: translator.get(x,x))
return df
Manages to do what I want for 2,840,759 rows in 14.49 seconds on my laptop, which will do fine.
Suppose I have a dataframe with (for example) 10 columns: a,b,c,d,e,f,g,h,i,j
I want to bucket these columns as follows: a,b,c into x, d,f,g into y, e,h,i into z and j into j.
Each row of the output will have the x column value equal to the non-NaN a or b or c value of the original df. In case of multiple non-NaN values for a,b,c columns for a particular row in the original df, the output df will just contain a list of those non-NaN values.
To give an example, if the original df is (- just means NaN to save typing effort):
a b c d e f g h i j
0 1 - - - 2 - 4 3 - -
1 - 6 - 0 4 - - - - 2
2 - 3 2 - - - - 1 - 9
The output will be:
x y z j
0 1 4 [2,3] -
1 6 0 4 2
2 [3,2] - 1 9
Is there an efficient way of doing this? I'm not even able to get started using conventional methods.
one way is to create a dictionary with your mappings, apply your column names, stack and to apply your groupby operation and unstack to your original shape.
I couldn't see any logic in your mappings so it will have to be a manual operation I'm afraid.
buckets = {'x': ['a', 'b', 'c'], 'y': ['d', 'f', 'g'], 'z': ['e', 'h', 'i'], 'j': 'j'}
df.columns = df.columns.map( {i : x for x,y in buckets.items() for i in y})
out = df.stack().groupby(level=[0,1]).agg(list).unstack(1)[buckets.keys()]
print(out)
x y z j
0 [1] [4] [2, 3] NaN
1 [6] [0] [4] [2]
2 [3, 2] NaN [1] [9]
First create the dict for mapping , the groupby
d = {'a':'x','b':'x','c':'x','d':'y','f':'y','g':'y','e':'z','h':'z','i':'z','j':'j'}
out = df.groupby(d,axis=1).agg(lambda x : [y[y!='-']for y in x.values])
Out[138]:
j x y z
0 [] [1] [4] [2, 3]
1 [2] [6] [0] [4]
2 [9] [3, 2] [] [1]
Starting with a very basic approach, let's define our buckets and simply iterate, then clean up:
buckets = {
'x': ['a', 'b', 'c'],
'y': ['d', 'e', 'f'],
'z': ['g', 'h', 'i'],
'j': ['j']
}
def clean(val):
val = [x for x in val if not np.isnan(val)]
if len(val) == 0:
return np.nan
elif len(val) == 1:
return val[0]
else:
return val
new_df = pd.DataFrame()
for new_col, old_cols in buckets.items():
new_df[key] = df[old_cols].values.tolist().apply(clean)
Here's how you can do it.
First, we define a method to perform the row-wise bucketing operation.
def bucket_rows(row):
row = row.dropna().to_list()
if len(row) == 0:
row = [np.nan]
return row
Then, we can use the pandas.DataFrame.apply method to map this function onto each row on a dataframe (here, a sub-dataframe, if you will, since we'll get the sub-df using the column names).
I have implemented everything in the following code snippet.
import numpy as np
import pandas as pd
bucket_cols=[["a", "b", "c"], ["d", "f", "g"], ["e", "h","i"], ["j"]]
bucket_names=["x", "y", "z", "j"]
buckets = {}
def bucket_rows(row):
row = row.dropna().to_list() # applying pd.Series.dropna method to remove NaN values
# if the list is empty, populate it with NaN
if len(row) == 0:
row = [np.nan]
# returns bucketed row
return row
# looping through buckets and perforing bucketing operation
for idx, cols in enumerate(bucket_cols):
bucket = df[cols].apply(bucket_rows, axis=1).to_list()
buckets[idx] = bucket
# creating bucketted df from buckets dict
df_bucketted = pd.DataFrame(buckets)
I have some measurements(as a dict) and a list with labels. Need to verify if labels are in my measurements and write it to an excelfile.
my output-excelfile need to look like this.
list1 = ['A', 'B', 'C', 'D']
measurement1 = {'A':1, 'B':1}
measurement2 = {'C':3, 'D':4}
#Output
'A' 'B' 'C' 'D'
measurement1 1 1 0 0
measurement2 0 0 1 1
I have no idea how to build the matrix with (0,1)
Hope you can help me.
EDIT
Finally i got a solution. At first i iterate over all measurements and wrote to dict measurements all missing labels.
Than building a dataframe with ones and putting with 3 loops zeros in the dataframe to the msising positions with .loc
d = pd.DataFrame(1, index = measurements.keys(), columns = list1)
for y in measurements.keys():
for z in measurements[y]:
for x in list1:
if x == z:
d.loc[y,z] = 0
Maybe its possible to make it with only 2 loops.
Use nested list comprehension with filtering for check membership in list1 and last create DataFrame by constructor:
list1 = ['A', 'B', 'C', 'D']
measurement1 = {'A':1, 'B':1}
measurement2 = {'C':3, 'D':4}
L = [measurement1, measurement2]
d = [dict.fromkeys([y for y in x.keys() if y in list1], 1) for x in L]
df = pd.DataFrame(d).fillna(0).astype(int)
print (df)
A B C D
0 1 1 0 0
1 0 0 1 1
This should work, using only standard Python:
list1 = ['A', 'B', 'C', 'D']
measurement1 = {'A':1, 'B':1}
measurement2 = {'C':3, 'D':4}
measurements = [measurement1, measurement2]
headers = { h: i for i, h in enumerate(list1) }
matrix = []
for measurement in measurements:
row = [0] * len(headers)
for header in measurement.keys():
row[headers[header]] = 1
matrix.append(row)
For your example, the output will be:
matrix
=> [[1, 1, 0, 0], [0, 0, 1, 1]]
You can use a list of the dictionaries ad create a dataframe then reindex with the list and convert to bool by checking notna
pd.DataFrame([measurement1,measurement2]).reindex(columns=list1).notna().astype(int)
A B C D
0 1 1 0 0
1 0 0 1 1
How do I retrieve the value of column Z and its average
if any value are >1
data=[9,2,3,4,5,6,7,8]
df = pd.DataFrame(np.random.randn(8, 5),columns=['A', 'B', 'C', 'D','E'])
fd=pd.DataFrame(data,columns=['Z'])
df=pd.concat([df,fd], axis=1)
l=[]
for x,y in df.iterrows():
for i,s in y.iteritems():
if s >1:
l.append(x)
print(df['Z'])
The expected output will most likely be a dictionary with the column name as key and the average of Z as its values.
Using a dictionary comprehension:
res = {col: df.loc[df[col] > 1, 'Z'].mean() for col in df.columns[:-1]}
# {'A': 9.0, 'B': 5.0, 'C': 8.0, 'D': 7.5, 'E': 6.666666666666667}
Setup used for above:
np.random.seed(0)
data = [9,2,3,4,5,6,7,8]
df = pd.DataFrame(np.random.randn(8, 5),columns=['A', 'B', 'C', 'D','E'])
fd = pd.DataFrame(data, columns=['Z'])
df = pd.concat([df, fd], axis=1)
Do you mean this?
df[df['Z']>1].loc[:,'Z'].mean(axis=0)
or
df[df['Z']>1]['Z'].mean()
I don't know if I understood your question correctly but do you mean this:
import pandas as pd
import numpy as np
data=[9,2,3,4,5,6,7,8]
columns = ['A', 'B', 'C', 'D','E']
df = pd.DataFrame(np.random.randn(8, 5),columns=columns)
fd=pd.DataFrame(data,columns=['Z'])
df=pd.concat([df,fd], axis=1)
print('df = \n', str(df))
anyGreaterThanOne = (df[columns] > 1).any(axis=1)
print('anyGreaterThanOne = \n', str(anyGreaterThanOne))
filtered = df[anyGreaterThanOne]
print('filtered = \n', str(filtered))
Zmean = filtered['Z'].mean()
print('Zmean = ', str(Zmean))
Result:
df =
A B C D E Z
0 -2.170640 -2.626985 -0.817407 -0.389833 0.862373 9
1 -0.372144 -0.375271 -1.309273 -1.019846 -0.548244 2
2 0.267983 -0.680144 0.304727 0.302952 -0.597647 3
3 0.243549 1.046297 0.647842 1.188530 0.640133 4
4 -0.116007 1.090770 0.510190 -1.310732 0.546881 5
5 -1.135545 -1.738466 -1.148341 0.764914 -1.140543 6
6 -2.078396 0.057462 -0.737875 -0.817707 0.570017 7
7 0.187877 0.363962 0.637949 -0.875372 -1.105744 8
anyGreaterThanOne =
0 False
1 False
2 False
3 True
4 True
5 False
6 False
7 False
dtype: bool
filtered =
A B C D E Z
3 0.243549 1.046297 0.647842 1.188530 0.640133 4
4 -0.116007 1.090770 0.510190 -1.310732 0.546881 5
Zmean = 4.5
let's say i have three list
listA = ['a','b','c', 'd']
listP = ['p', 'q', 'r']
listX = ['x', 'z']
so the dataframe will have 4*3*2 = 24 rows.
now, the simplest way to solve this problem is to do this:
df = pd.DataFrame(columns=['A','P','X'])
for val1 in listA:
for val2 in listP:
for val3 in listX:
df.loc[<indexvalue>] = [val1,val2,val3]
now in the real scenario I will have about 800k rows and 12 columns (so 12 nesting in the loops). is there any way i can create this dataframe much faster?
Similar discussion here. Apparently np.meshgrid is more efficient for large data (as an alternative to itertools.product.
Application:
v = np.stack(i.ravel() for i in np.meshgrid(listA, listP, listX)).T
df = pd.DataFrame(v, columns=['A', 'P', 'X'])
>> A P X
0 a p x
1 a p z
2 b p x
3 b p z
4 c p x
You could use itertools.product:
import pandas as pd
from itertools import product
listA = ['a', 'b', 'c', 'd']
listP = ['p', 'q', 'r']
listX = ['x', 'z']
df = pd.DataFrame(data=list(product(listA, listP, listX)), columns=['A','P','X'])
print(df.head(10))
Output
A P X
0 a p x
1 a p z
2 a q x
3 a q z
4 a r x
5 a r z
6 b p x
7 b p z
8 b q x
9 b q z