How to fix np. where functions in pandas - python

My Data frame looks like -
id marital_status age city1 city2
1 Married 32 7 64
2 Married 34 8 39
3 Single 53 0 72
4 Divorce 37 2 83
5 Divorce 42 10 52
6 Single 29 3 82
7 Married 37 8 64
Size of the data frame is 22.4 Million records.
My objective is based on the conditional statement my final data frame looks like -
id marital_status age city1 city2 present
1 Married 32 12 64 1
2 Married 34 8 39 0
3 Single 53 0 72 0
4 Divorce 37 2 83 0
5 Divorce 42 10 52 0
6 Single 29 3 82 0
7 Married 37 8 64 1
What I have done so far -
test_df = pd.read_csv('city.csv')
condition = ((test_df['city1'] >= 5) &\
(test_df['marital_status'] == 'Married') &\
(test_df['age'] >= 32))
test_df.loc[:, 'present'] = test_df.where(condition, 1)
But got NA values in 'present' columns
Can anybody help me?

It is not np.where function, but DataFrame.where in your solution.
I think you need set values by condition:
test_df['present'] = np.where(condition, 1, 0)
Or cast True/False to 1/0 by Series.astype:
test_df['present'] = condition.astype(int)
print (test_df)
id marital_status age city1 city2 present
0 1 Married 32 12 64 1
1 2 Married 34 8 39 1
2 3 Single 53 0 72 0
3 4 Divorce 37 2 83 0
4 5 Divorce 42 10 52 0
5 6 Single 29 3 82 0
6 7 Married 37 8 64 1

Related

How to swap many columns into rows with rows by being grouped in pandas? [duplicate]

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Pandas Melt Function
(2 answers)
Closed 10 months ago.
Let's say that these are my data
day region cars motorcycles bikes buses
1 A 0 1 1 2
2 A 4 0 6 8
3 A 2 9 8 0
1 B 6 12 34 82
2 B 13 92 76 1
3 B 23 87 98 9
1 C 29 200 31 45
2 C 54 80 23 89
3 C 129 90 231 56
How do I make the regions columns and the columns(except for the day column) rows?
Basically, I want it to look like this :
day vehicle_type A B C
1 cars 0 6 29
2 cars 4 13 54
3 cars 2 23 129
1 motorcycles 1 12 200
2 motorcycles 0 92 80
3 motorcycles 9 87 90
1 bikes 1 34 31
2 bikes 6 76 23
3 bikes 8 98 231
1 buses 2 82 45
2 buses 8 1 89
3 buses 0 9 56
Use stack and unstack:
(
df.set_index(["day", "region"])
.rename_axis(columns="vehicle_type")
.stack()
.unstack(level=1)
.rename_axis(columns=None)
.reset_index()
)

Sum row values of all columns where column names meet string match condition

I have the following dataset:
ID Length Width Range_CAP Capacity_CAP
0 1 33 25 16 50
1 2 34 22 11 66
2 3 22 12 15 42
3 4 46 45 66 54
4 5 16 6 23 75
5 6 21 42 433 50
I basically want to sum the row values of the columns only where the columns match a string (in this case, all columns with _CAP at the end of their name). And store the sum of the result in a new column.
So that I end up with a dataframe that looks something like this:
ID Length Width Range_CAP Capacity_CAP CAP_SUM
0 1 33 25 16 50 66
1 2 34 22 11 66 77
2 3 22 12 15 42 57
3 4 46 45 66 54 120
4 5 16 6 23 75 98
5 6 21 42 433 50 483
I first tried to use the solution recommended in this question here:
Summing columns in Dataframe that have matching column headers
However, the solution doesn't work for me since they are summing up columns that have the same exact name so a simple groupby can accomplish the result whereas I am trying to sum columns with specific string matches only.
Code to recreate above sample dataset:
data1 = [['1', 33,25,16,50], ['2', 34,22,11,66],
['3', 22,12,15,42],['4', 46,45,66,54],
['5',16,6,23,75], ['6', 21,42,433,50]]
df = pd.DataFrame(data1, columns = ['ID', 'Length','Width','Range_CAP','Capacity_CAP'])
Let us do filter
df['CAP_SUM'] = df.filter(like='CAP').sum(1)
Out[86]:
0 66
1 77
2 57
3 120
4 98
5 483
dtype: int64
If have other CAP in front
df.filter(regex='_CAP$').sum(1)
Out[92]:
0 66
1 77
2 57
3 120
4 98
5 483
dtype: int64
One approach is:
df['CAP_SUM'] = df.loc[:, df.columns.str.endswith('_CAP')].sum(1)
print(df)
Output
ID Length Width Range_CAP Capacity_CAP CAP_SUM
0 1 33 25 16 50 66
1 2 34 22 11 66 77
2 3 22 12 15 42 57
3 4 46 45 66 54 120
4 5 16 6 23 75 98
5 6 21 42 433 50 483
The expression:
df.columns.str.endswith('_CAP')
creates a boolean mask where the values are True if and only if the column name ends with CAP. As an alternative use filter, with the following regex:
df['CAP_SUM'] = df.filter(regex='_CAP$').sum(1)
print(df)
Output (of filter)
ID Length Width Range_CAP Capacity_CAP CAP_SUM
0 1 33 25 16 50 66
1 2 34 22 11 66 77
2 3 22 12 15 42 57
3 4 46 45 66 54 120
4 5 16 6 23 75 98
5 6 21 42 433 50 483
You may try this:
columnstxt = df.columns
df['sum'] = 0
for i in columnstxt:
if i.find('_CAP') != -1:
df['sum'] = df['sum'] + df[i]
else:
pass

Pandas code to get the count of each values

Here I'm sharing a sample data(I'm dealing with Big Data), the "counts" value varies from 1 to 3000+,, sometimes more than that..
Sample data looks like :
ID counts
41 44 17 16 19 52 6
17 30 16 19 4
52 41 44 30 17 16 6
41 44 52 41 41 41 6
17 17 17 17 41 5
I was trying to split "ID" column into multiple & trying to get that count,,
data= reading the csv_file
split_data = data.ID.apply(lambda x: pd.Series(str(x).split(" "))) # separating columns
as I mentioned, I'm dealing with big data,, so this method is not that much effective..i'm facing problem to get the "ID" counts
I want to collect the total counts of each ID & map it to the corresponding ID column.
Expected output:
ID counts 16 17 19 30 41 44 52
41 41 17 16 19 52 6 1 1 1 0 2 0 1
17 30 16 19 4 1 1 1 1 0 0 0
52 41 44 30 17 16 6 1 1 0 1 1 1 1
41 44 52 41 41 41 6 0 0 0 0 4 1 1
17 17 17 17 41 5 0 4 0 0 1 0 0
If you have any idea,, please let me know
Thank you
Use Counter for get counts of values splitted by space in list comprehension:
from collections import Counter
L = [{int(k): v for k, v in Counter(x.split()).items()} for x in df['ID']]
df1 = pd.DataFrame(L, index=df.index).fillna(0).astype(int).sort_index(axis=1)
df = df.join(df1)
print (df)
ID counts 16 17 19 30 41 44 52
0 41 44 17 16 19 52 6 1 1 1 0 1 1 1
1 17 30 16 19 4 1 1 1 1 0 0 0
2 52 41 44 30 17 16 6 1 1 0 1 1 1 1
3 41 44 52 41 41 41 6 0 0 0 0 4 1 1
4 17 17 17 17 41 5 0 4 0 0 1 0 0
Another idea, but I guess slowier:
df1 = df.assign(a = df['ID'].str.split()).explode('a')
df1 = df.join(pd.crosstab(df1['ID'], df1['a']), on='ID')
print (df1)
ID counts 16 17 19 30 41 44 52
0 41 44 17 16 19 52 6 1 1 1 0 1 1 1
1 17 30 16 19 4 1 1 1 1 0 0 0
2 52 41 44 30 17 16 6 1 1 0 1 1 1 1
3 41 44 52 41 41 41 6 0 0 0 0 4 1 1
4 17 17 17 17 41 5 0 4 0 0 1 0 0

Comparing two consecutive rows and creating a new column based on a specific logical operation

I have a data frame with two columns
df = ['xPos', 'lineNum']
import pandas as pd
data = '''\
xPos lineNum
40 1
50 1
75 1
90 1
42 2
75 2
110 2
45 3
70 3
95 3
125 3
38 4
56 4
74 4'''
I have created the aggregate data frame for this by using
aggrDF = df.describe(include='all')
command
and I am interested in the minimum of the xPos value. So, i get it by using
minxPos = aggrDF.ix['min']['xPos']
Desired output
data = '''\
xPos lineNum xDiff
40 1 2
50 1 10
75 1 25
90 1 15
42 2 4
75 2 33
110 2 35
45 3 7
70 3 25
95 3 25
125 3 30
38 4 0
56 4 18
74 4 18'''
The logic
I want to compere the two consecutive rows of the data frame and calculate a new column based on this logic:
if( df['LineNum'] != df['LineNum'].shift(1) ):
df['xDiff'] = df['xPos'] - minxPos
else:
df['xDiff'] = df['xPos'].shift(1)
Essentially, I want the new column to have the difference of the two consecutive rows in the df, as long as the line number is the same.
If the line number changes, then, the xDiff column should have the difference with the minimum xPos value that I have from the aggregate data frame.
Can you please help? thanks,
These two lines should do it:
df['xDiff'] = df.groupby('lineNum').diff()['xPos']
df.loc[df['xDiff'].isnull(), 'xDiff'] = df['xPos'] - minxPos
>>> df
xPos lineNum xDiff
0 40 1 2.0
1 50 1 10.0
2 75 1 25.0
3 90 1 15.0
4 42 2 4.0
5 75 2 33.0
6 110 2 35.0
7 45 3 7.0
8 70 3 25.0
9 95 3 25.0
10 125 3 30.0
11 38 4 0.0
12 56 4 18.0
13 74 4 18.0
You just need groupby lineNum and apply the condition you already writing down
df['xDiff']=np.concatenate(df.groupby('lineNum').apply(lambda x : np.where(x['lineNum'] != x['lineNum'].shift(1),x['xPos'] - x['xPos'].min(),x['xPos'].shift(1)).astype(int)).values)
df
Out[76]:
xPos lineNum xDiff
0 40 1 0
1 50 1 40
2 75 1 50
3 90 1 75
4 42 2 0
5 75 2 42
6 110 2 75
7 45 3 0
8 70 3 45
9 95 3 70
10 125 3 95
11 38 4 0
12 56 4 38
13 74 4 56

Appending a dataframe to the right of another one with the same columns

I have two different dataframes with the same column names:
eg.
0 1 2
0 10 13 17
1 14 21 34
2 68 32 12
0 1 2
0 45 56 32
1 9 22 86
2 55 64 19
I would like to append the second frame to the right of the first one while continuing the column names from the first frame. The output would look like this:
0 1 2 3 4 5
0 10 13 17 45 56 32
1 14 21 34 9 22 86
2 68 32 12 55 64 19
What is the most efficient way of doing this?
Thanks.
Use pd.concat first and then reset the columns.
In [1108]: df_out = pd.concat([df1, df2], axis=1)
In [1109]: df_out.columns = list(range(len(df_out.columns)))
In [1110]: df_out
Out[1110]:
0 1 2 3 4 5
0 10 13 17 45 56 32
1 14 21 34 9 22 86
2 68 32 12 55 64 19
Why not join:
>>> df=df.join(df_,lsuffix='_')
>>> df.columns=range(len(df.columns))
>>> df
0 1 2 3 4 5
0 10 13 17 45 56 32
1 14 21 34 9 22 86
2 68 32 12 55 64 19
>>>
join is your friend, i use lsuffix (could be rsuffix too) to ignore error for saying duplicate columns.

Categories

Resources