Replacing values from a pandas dataframe - python

I have a data frame which have columns with strings and integers.
df = pd.DataFrame([ ['Manila', 5,12,0], ['NY',9,0,14], ['Berlin',8,10,6] ], columns = ['a','b','c','d'])
I want to change all the values to "1" where the value is greater than 1 and the zeros will be reamin the same.
So I tried with apply(lambda x: 1 if x > 1 else 0) but it shows its ambigious.
Then I tried to write a function separately as follow:
def find_value(x):
try:
x = int(x)
print(x)
if x > 1:
x = 1
else:
x = 0
except:
return x
return x
and then apply it
df = df.apply(find_value, axis=1)
But the output does not change and the df remains as it was.
I think there should be some apply function which can be applied on all of the eligible columns (those columns which has numerical values). But I am missing the point somehow. Can anyone please enlighten me how to solve it (with or without "map" function)?

Use DataFrame.select_dtypes for get numbers columns, compare for greater like 1 and then map True, False to 1,0 by casting to integers, for change data in original is used DataFrame.update:
df.update(df.select_dtypes(np.number).gt(1).astype(int))
print (df)
a b c d
0 Manila 1 1 0
1 NY 1 0 1
2 Berlin 1 1 1
Or use DataFrame.clip if all integers and no negative numbers:
df.update(df.select_dtypes(np.number).clip(upper=1))
print (df)
a b c d
0 Manila 1 1 0
1 NY 1 0 1
2 Berlin 1 1 1
EDIT:
Your solution working with DataFrame.applymap:
df = df.applymap(find_value)

Related

Python pandas create new column with string code based on boolean rows

I have a dataframe with multiple columns containing booleans/ints(1/0). I need a new result pandas column with strings that are built by following code: How many times the True's are consecutive, if the chain is interrupted or not, and from what column to what column the trues are.
For example this is the following dataframe:
column_1 column_2 column_3 column_4 column_5 column_6 column_7 column_8 column_9 column_10
0 0 1 0 1 1 1 1 0 0 1
1 0 1 1 0 1 1 1 0 0 1
2 1 1 0 0 0 1 1 0 0 1
3 1 1 1 0 0 0 0 1 1 1
4 1 1 1 0 0 1 0 0 1 1
5 1 1 1 0 0 0 1 1 0 1
6 0 1 1 1 1 1 1 0 1 0
Where the following row for example: 1: [0 1 1 0 1 1 1 0 0 1]
Would result in code string in the column_result: i2/2-3/c2-c3_c5-c7/6 which is build in four segments I can read somewhere in my code later.
Segment 1:
Where 'i' stands for interrupted, if not interrupted would be 'c' for consecutive
2 stands for how many times it found 2 or more consecutive True's,
Segment 2:
The consecutive count of the consecutive group, in this case the first consecutive count is 2, and the second count is 3..
Semgent 3:
The number/id of the column where the first True was found and the column number of where the last True was found of that consecutive True's.
Semgent 4:
Just the total count of Trues in the row.
Another example would be the following row: 6: [0 1 1 1 1 1 1 0 1 0]
Would result in code string in the column_result: c1/6/c2-c7/7
The below code is the startcode I used to create the above dataframe that has random int's for bools:
def create_custom_result(df: pd.DataFrame) -> pd.Series:
return df
def create_dataframe() -> pd.DataFrame:
df = pd.DataFrame() # empty df
for i in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]: # create random bool/int values
df[f'column_{i}'] = np.random.randint(2, size=50)
df["column_result"] = '' # add result column
return df
if __name__=="__main__":
df = create_dataframe()
custom_results = create_custom_result(df=df)
Would someone have any idea of how to tackle this? To be honest I have no idea where to start. I found the following that probably came closest: count sets of consecutive true values in a column, however, it uses the column and not the rows horizontal. Maybe someone can tell me if I should try np.array stuff, or maybe pandas has some function that can help me? I found some groupby functions that work horizontal, but I wouldnt know how to convert that to the string code to be used in the result column. Or should I loop through the Dataframe by rows and then build the column_result code in segments?
Thanks in advance!
I tried some things already, looping through the dataframe row by row, but had no idea how to build a new column with the code strings.
I also found this artikel: pandas groupby .. but wouldnt know how to create a new column str data by the group I found. Also, almost everything I find is group stuff by the single column and not through the rows of all columns.
these codes maybe works ?
df = pd.DataFrame(np.random.randint(0,2, size=(12,8)))
df.columns=["col1","col2","col3","col4","col5","col6","col7","col8"]
def func(df:pd.DataFrame) -> pd.DataFrame:
result_list = []
copy = df.copy()
cumsum = copy.cumsum(axis=1)
for r,s in cumsum.iterrows():
count = 0
last = -1
interrupted = 0
consecutive = 0
consecutives = []
ranges = []
for x in s.values:
count += 1
if x != 0:
if x!=last:
consecutive += 1
last = x
if consecutive == 2:
ranges.append(count-1)
elif x==last:
if consecutive > 1:
interrupted += 1
ranges.append(count-1)
consecutives.append(str(consecutive))
consecutive = 0
else:
if consecutive > 1:
consecutives.append(str(consecutive))
ranges.append(count)
result = f'{interrupted}i/{len(consecutives)}c/{"-".join(consecutives)}/{"_".join([ f"c{ranges[i]}-c{ranges[i+1]}" for i in range(0,len(ranges),2) ])}/{last}'
result_list.append(result.split("/"))
copy["results"] = pd.Series(["/".join(i) for i in result_list])
copy[["interrupts_count","consecutives_count","consecutives lengths","consecutives columns ranges","total"]] = pd.DataFrame(np.array(result_list))
return copy
result_df = func(df)
Maybe go with simple class for each column that will receive series from original DataFrame (i.e. sliced vertically) and new value. Using original DataFrame sliced vertical array calculate all starting values as fields (start of consecutive true values, length of consecutive true values, last value..). And finally using start values and new next value update fields and prepare string output.

Column header equals column value pandas [duplicate]

In my dataframe, I have a categorical variable that I'd like to convert into dummy variables. This column however has multiple values separated by commas:
0 'a'
1 'a,b,c'
2 'a,b,d'
3 'd'
4 'c,d'
Ultimately, I'd want to have binary columns for each possible discrete value; in other words, final column count equals number of unique values in the original column. I imagine I'd have to use split() to get each separate value but not sure what to do afterwards. Any hint much appreciated!
Edit: Additional twist. Column has null values. And in response to comment, the following is the desired output. Thanks!
a b c d
0 1 0 0 0
1 1 1 1 0
2 1 1 0 1
3 0 0 0 1
4 0 0 1 1
Use str.get_dummies
df['col'].str.get_dummies(sep=',')
a b c d
0 1 0 0 0
1 1 1 1 0
2 1 1 0 1
3 0 0 0 1
4 0 0 1 1
Edit: Updating the answer to address some questions.
Qn 1: Why is it that the series method get_dummies does not accept the argument prefix=... while pandas.get_dummies() does accept it
Series.str.get_dummies is a series level method (as the name suggests!). We are one hot encoding values in one Series (or a DataFrame column) and hence there is no need to use prefix. Pandas.get_dummies on the other hand can one hot encode multiple columns. In which case, the prefix parameter works as an identifier of the original column.
If you want to apply prefix to str.get_dummies, you can always use DataFrame.add_prefix
df['col'].str.get_dummies(sep=',').add_prefix('col_')
Qn 2: If you have more than one column to begin with, how do you merge the dummies back into the original frame?
You can use DataFrame.concat to merge one hot encoded columns with the rest of the columns in dataframe.
df = pd.DataFrame({'other':['x','y','x','x','q'],'col':['a','a,b,c','a,b,d','d','c,d']})
df = pd.concat([df, df['col'].str.get_dummies(sep=',')], axis = 1).drop('col', 1)
other a b c d
0 x 1 0 0 0
1 y 1 1 1 0
2 x 1 1 0 1
3 x 0 0 0 1
4 q 0 0 1 1
The str.get_dummies function does not accept prefix parameter, but you can rename the column names of the returned dummy DataFrame:
data['col'].str.get_dummies(sep=',').rename(lambda x: 'col_' + x, axis='columns')

How to create a new column in a Pandas DataFrame based on a column in another DataFrame?

*I am new to Python and Pandas
I need to do the following
I have 2 DataFrames, lets call them df1 and df2
df1
Index Req ID City_Atlanta City_Seattle City_Boston Result
0 X 1 0 0 0
1 Y 0 1 0 0
2 Z 0 0 1 1
df2
Index Req_ID City
0 X Atlanta
1 Y Seattle
2 Z Boston
I want to add a column in df2 called result such that df2.result = False if df1.result = 0 and df2.result = True if df1.result = 1
The final result should look like
df2
Index Req_ID City result
0 X Atlanta False
1 Y Seattle False
2 Z Boston True
I am new to asking question on Stack Overflow as well so pardon any common mistakes.
Considering Req ID is the matching key and the length of the dfs are not the same, you can use:
df2['Result'] = df2.Req_ID.map(dict(zip(df['Req ID'],df.Result))).astype(bool)
0 False
1 False
2 True
If lengths are equal you can use the above sol by #aws_apprentice
You can apply a bool to the 0,1's.
df2['Result'] = df1['Result'].apply(bool)
You can also map a dictionary of values.
df2['Result'] = df1['Result'].map({0: False, 1: True})
Assuming they're the same lengths you can do:
df2['Result'] = df1['Result']==1

Pandas: Trying to drop rows based on for loop?

I have a dataframe consisting of multiple columns and then two of the columns, x and y, that are both filled with numbers ranging from 1 to 3.I want to drop all rows where the number in x is less than in the number in y. For example, if in one row x = 1 and y = 3 I want to drop that entire row. This is the code I've written so far:
for num1 in df.x:
for num2 in df.y:
if (num1< num2):
df.drop(df.iloc[num1], inplace = True)
but I keep getting the error:
labels ['new' 'active' 1 '1'] not contained in axis
Anyhelp is greatly appreciated. Thanks!
I would avoid loops in your scenario, and just use .drop:
df.drop(df[df['x'] < df['y']].index, inplace=True)
Example:
df = pd.DataFrame({'x':np.random.randint(0,4,5), 'y':np.random.randint(0,4,5)})
>>> df
x y
0 1 2
1 2 1
2 3 1
3 2 1
4 1 3
df.drop(df[df['x'] < df['y']].index, inplace = True)
>>> df
x y
1 2 1
2 3 1
3 2 1
[EDIT]: Or, more simply, without using drop:
df=df[~(df['x'] < df['y'])]
Writing two for loops is very ineffecient, instead you can
just compare the two columns
[df['x'] >= df['y']]
These returns a boolean array which you can use to filter the dataframe
df[df['x'] >= df['y']]
I think better is use boolean indexing or query with changing condition to >=:
df[df['x'] >= df['y']]
Or:
df = df.query('x >= y')
Sample:
df = pd.DataFrame({'x':[1,2,3,2], 'y':[0,4,5,1]})
print (df)
x y
0 1 0
1 2 4
2 3 5
3 2 1
df = df[df['x'] >= df['y']]
print (df)
x y
0 1 0
3 2 1

Add columns to pandas dataframe containing max of each row, AND corresponding column name

My system
Windows 7, 64 bit
python 3.5.1
The challenge
I've got a pandas dataframe, and I would like to know the maximum value for each row, and append that info as a new column. I would also like to know the name of the column where the maximum value is located. And I would like to add another column to the existing dataframe containing the name of the column where the max value can be found.
A similar question has been asked and answered for R in this post.
Reproducible example
In[1]:
# Make pandas dataframe
df = pd.DataFrame({'a':[1,0,0,1,3], 'b':[0,0,1,0,1], 'c':[0,0,0,0,0]})
# Calculate max
my_series = df.max(numeric_only=True, axis = 1)
my_series.name = "maxval"
# Include maxval in df
df = df.join(my_series)
df
Out[1]:
a b c maxval
0 1 0 0 1
1 0 0 0 0
2 0 1 0 1
3 1 0 0 1
4 3 1 0 3
So far so good. Now for the add another column to the existing dataframe containing the name of the column part:
In[2]:
?
?
?
# This is what I'd like to accomplish:
Out[2]:
a b c maxval maxcol
0 1 0 0 1 a
1 0 0 0 0 a,b,c
2 0 1 0 1 b
3 1 0 0 1 a
4 3 1 0 3 a
Notice that I'd like to return all column names if multiple columns contain the same maximum value. Also please notice that the column maxval is not included in maxcol since that would not make much sense. Thanks in advance if anyone out there finds this interesting.
You can compare the df against maxval using eq with axis=0, then use apply with a lambda to produce a boolean mask to mask the columns and join them:
In [183]:
df['maxcol'] = df.ix[:,:'c'].eq(df['maxval'], axis=0).apply(lambda x: ','.join(df.columns[:3][x==x.max()]),axis=1)
df
Out[183]:
a b c maxval maxcol
0 1 0 0 1 a
1 0 0 0 0 a,b,c
2 0 1 0 1 b
3 1 0 0 1 a
4 3 1 0 3 a

Categories

Resources