Split a pandas dataframe into rows based on an integer column - python

Not a an ideal title but I wouldn't know how to describe it better.
I have a dataframe (df1) and want to split it on the column "chicken" so that:
each chicken that laid an egg becomes a distinct row
the chickens that didn't lay an egg are aggregated in a unique row.
The output I need is df2, example:
In farm "A", there are 5 chicken, of which 2 chicken laid an egg, so there are 2 rows with egg = "True" and weight = 1 each, and 1 row with egg = "False" and weight = 3 (the 3 chicken that didn't lay an egg).
The code I came up with is messy, can you guys think of a cleaner way of doing it? Thanks!!
#code to create df1:
df1 = pd.DataFrame({'farm':["A","B","C"],"chicken":[5,10,5],"eggs":[2,3,0]})
df1=df1[["farm","chicken","eggs"]]
#code to transform df1 to df2:
df2 = pd.DataFrame()
for i in df1.index:
number_of_trues = df1.iloc[i]["eggs"]
number_of_falses = df1.iloc[i]["chicken"] - number_of_trues
col_farm = [df1.iloc[i]["farm"]]*(number_of_trues+1)
col_egg = ["True"]*number_of_trues + ["False"]*1
col_weight = [1]*number_of_trues + [number_of_falses]
mini_df = pd.DataFrame({"farm":col_farm,"egg":col_egg,"weight":col_weight})
df2=df2.append(mini_df)
df2 = df2[["farm","egg","weight"]]
df2

This is customize solution , by creating two different sub dataframe then concat it back to achieve the expected output.Key method : repeat
s=pd.DataFrame({'farm':df1.farm.repeat(df1.eggs),'egg':[True]*df1.eggs.sum(),'weight':[1]*df1.eggs.sum()})
t=pd.DataFrame({'farm':df1.farm,'egg':[False]*len(df1.farm),'weight':df1.chicken-df1.eggs})
pd.concat([t,s]).sort_values(['farm','egg'],ascending=[True,False])
Out[847]:
egg farm weight
0 True A 1
0 True A 1
0 False A 3
1 True B 1
1 True B 1
1 True B 1
1 False B 7
2 False C 5

Related

Get a count of occurrence of string in each row and column of pandas dataframe

import pandas as pd
# list of paragraphs from judicial opinions
# rows are opinions
# columns are paragraphs from the opinion
opinion1 = ['sentenced to life','sentenced to death. The sentence ...','', 'sentencing Appellant for a term of life imprisonment']
opinion2 = ['Justice Smith','This concerns a sentencing hearing.', 'The third sentence read ...', 'Defendant rested.']
opinion3 = ['sentence sentencing sentenced','New matters ...', 'The clear weight of the evidence', 'A death sentence']
data = [opinion1, opinion2, opinion3]
df = pd.DataFrame(data, columns = ['p1','p2','p3','p4'])
# This works for one column. I have 300+ in the real data set.
df['p2'].str.contains('sentenc')
How do I determine whether 'sentenc' is in columns 'p1' through 'p4'?
Desired output would be something like:
True True False True
False True True False
True False False True
How do I retrieve a count of the number of times that 'sentenc' appears in each cell?
Desired output would be a count for each cell of the number of times 'sentenc' appears:
1 2 0 1
0 1 1 0
3 0 0 1
Thank you!
Use pd.Series.str.count:
counts = df.apply(lambda col: col.str.count('sentenc'))
Output:
>>> counts
p1 p2 p3 p4
0 1 2 0 1
1 0 1 1 0
2 3 0 0 1
To get it in boolean form, use .str.contains, or call .astype(bool) with the code above:
bools = df.apply(lambda col: col.str.contains('sentenc'))
or
bools = df.apply(lambda col: col.str.count('sentenc')).astype(bool)
Both will work just fine.

How to extract data based on 2 columns in a data frame and make a new column using Python?

I have 2 columns in my data frame. “adult” represents the number of adults in a hotel room and “children” represents the number of children in a room.
I want to create a new column based on these two.
For example if df['adults'] == 2 and df[‘children’]==0 the value of the new column would be "couple with no children".
And if the df['adults'] = 2 and df[‘children’]=1 the value of the new column would be "couple with 1 child".
I have a big amount of data and I want the code to run fast.
Any advice? This is a sample of the inputs and the output that I need.
adult children family_status
2 0 "Couple without children"
2 0 "Couple without children"
2 1 "Couple with one child"
Use np.select
df
adult children
0 2 0
1 2 0
2 2 1
condlist = [(df['adults']==2) & (df['children']==0),(df['adults']==2) & (df['children']==1)]
choicelist = ['couple with no children','couple with 1 child']
df['family_status'] = np.select(condlist,choicelist,np.nan)
df
adult children family_status
0 2 0 couple with no children
1 2 0 couple with no children
2 2 1 couple with 1 child
You can try:
df['family_status'] = df.apply(lambda x: 'adult with no child' if (x['adult']==2 and x['children']==0)
else ( 'adult with 1 child'
if (x['adult']==2 and x['children']==1) else ''), axis=1)
Hope this will help you!!

Pandas DataFrame : selection of multiple elements in several columns

I have this Python Pandas DataFrame DF :
DICT = { 'letter': ['A','B','C','A','B','C','A','B','C'],
'number': [1,1,1,2,2,2,3,3,3],
'word' : ['one','two','three','three','two','one','two','one','three']}
DF = pd.DataFrame(DICT)
Which looks like :
letter number word
0 A 1 one
1 B 1 two
2 C 1 three
3 A 2 three
4 B 2 two
5 C 2 one
6 A 3 two
7 B 3 one
8 C 3 three
And I want to extract the lines
letter number word
A 1 one
B 2 two
C 3 three
First I tired :
DF[(DF['letter'].isin(("A","B","C"))) &
DF['number'].isin((1,2,3)) &
DF['word'].isin(('one','two','three'))]
Of course it didn't work, and everything has been selected
Then I tested :
Bool = DF[['letter','number','word']].isin(("A",1,"one"))
DF[np.all(Bool,axis=1)]
Good, it works ! but only for one line ...
If we take the next step and give an iterable to .isin() :
Bool = DF[['letter','number','word']].isin((("A",1,"one"),
("B",2,"two"),
("C",3,"three")))
Then it fails, the Boolean array is full of False ...
What I'm doing wrong ? Is there a more elegant way to do this selection based on several columns ?
(Anyway, I want to avoid a for loop, because the real DataFrames I'm using are really big, so I'm looking for the fastest optimal way to do the job)
Idea is create new DataFrame with all triple values and then merge with original DataFrame:
L = [("A",1,"one"),
("B",2,"two"),
("C",3,"three")]
df1 = pd.DataFrame(L, columns=['letter','number','word'])
print (df1)
letter number word
0 A 1 one
1 B 2 two
2 C 3 three
df = DF.merge(df1)
print (df)
letter number word
0 A 1 one
1 B 2 two
2 C 3 three
Another idea is create list of tuples, convert to Series and then compare by isin:
s = pd.Series(list(map(tuple, DF[['letter','number','word']].values.tolist())),index=DF.index)
df1 = DF[s.isin(L)]
print (df1)
letter number word
0 A 1 one
4 B 2 two
8 C 3 three

Compare 2 columns in different dataframes with a primary key condition without merge

I have 2 different dataframe Ex:
Df1:
User_id User_name User_phn
1 Alex 1234123
2 Danny 4234123
3 Bryan 5234123
Df2:
User_id User_name User_phn
1 Alex 3234123
2 Chris 4234123
3 Bryan 5234123
4 Bexy 6234123
user_id is the primary key in both the tables and I need to compare both the dataframes using the user_id as a condition and result me with values which are having matching and mismatch values without merging the dataframes into a new dataframe. We will be processing more than 100 millions of records with huge datasets, that why I do not want to merge again in a new dataframe which I think would consume memory again.
Result:
User_id User_name User_phn
1 Alex Mismatch
2 Mismatch 4234123
3 Bryan 5234123
4 Mismatch Mismatch
Not easy, but possible by comparing Series of tuples created by combinations of columns and comparing by isin:
s11 = pd.Series(list(map(tuple, Df1[['User_id','User_name']].values.tolist())))
s12 = pd.Series(list(map(tuple, Df2[['User_id','User_name']].values.tolist())))
s21 = pd.Series(list(map(tuple, Df1[['User_id','User_phn']].values.tolist())))
s22 = pd.Series(list(map(tuple, Df2[['User_id','User_phn']].values.tolist())))
Df2.loc[~s12.isin(s11), 'User_name'] = 'Mismatch'
Df2.loc[~s22.isin(s21), 'User_phn'] = 'Mismatch'
print (Df2)
User_id User_name User_phn
0 1 Alex Mismatch
1 2 Mismatch 4234123
2 3 Bryan 5234123
3 4 Mismatch Mismatch
Solution with merge with test unmatched pairs (missing values) by isna:
s1 = Df2.merge(Df1, how='left', on=['User_id','User_name'], suffixes=('_',''))['User_phn']
print (s1)
0 1234123.0
1 NaN
2 5234123.0
3 NaN
Name: User_phn, dtype: float64
s2 = Df2.merge(Df1, how='left', on=['User_id','User_phn'], suffixes=('_',''))['User_name']
print (s2)
0 NaN
1 Danny
2 Bryan
3 NaN
Name: User_name, dtype: object
Df2.loc[s1.isna(), 'User_name'] = 'Mismatch'
Df2.loc[s2.isna(), 'User_phn'] = 'Mismatch'
print (Df2)
User_id User_name User_phn
0 1 Alex Mismatch
1 2 Mismatch 4234123
2 3 Bryan 5234123
3 4 Mismatch Mismatch
You could also try this approach by using series map:
Df_new = Df2.copy()
cond1 = Df_new['User_phn'].isin(Df1['User_phn'])
cond2 = Df_new['User_name'].isin(Df1['User_name'])
Df_new.loc[~cond1, 'User_phn'] = Df_new.loc[~cond1, 'User_phn'].map(Df1['User_phn']).fillna('Mismatch')
Df_new.loc[~cond2, 'User_name'] = Df_new.loc[~cond2, 'User_name'].map(Df1['User_name']).fillna('Mismatch')
I wrote the code and tackled the problem in a rather simple manner. I just compared every row of the two databases and did the comparison and appended resultant row to a result database. Let me know if this works.
import pandas as pd
data = [[1,'Alex','1234123'],[2,'Danny','4234123'],[3,'Bryan','5234123']]
df = pd.DataFrame(data,columns=['User_id','User_name','User_phn'])
print (df)
data = [[1,'Alex','3234123'],[2,'Chris','4234123'],[3,'Bryan','5234123'],[4,'Bexy','6234123']]
df_2 = pd.DataFrame(data,columns=['User_id','User_name','User_phn'])
print (df_2)
l=max(len(df.index),len(df_2.index))
df_res = pd.DataFrame(columns=['User_id','User_name','User_phn'])
df_mat = df.as_matrix()
df_2_mat = df_2.as_matrix()
for i in range(0,l):
try:
arr=[]
arr.append(df_mat[i][0])
for k in range(1,3):
if df_mat[i][k] == df_2_mat[i][k]:
arr.append(df_mat[i][k])
else:
arr.append("Mismatch")
df_res.loc[i] = arr
except:
df_res.loc[i] = [i+1,"Mismatch","Mismatch"]
print(df_res)
Hi Narayana Kandukuri,
I guess my code may be simple, have a look.
import pandas as pd
df1 = pd.DataFrame([[1,'Alex',1234123],[2,'Danny',4234123],[3,'Bryan',5234123]],columns=['User_id','User_name','User_phn'])
df2 = pd.DataFrame([[1,'Alex',3234123],[2,'Chris',4234123],[3,'Bryan',5234123],[4,'Bexy',6234123]],columns=['User_id','User_name','User_phn'])
temp = df2[['User_id']] #Saving this for later use.
Bool_Data = (df1==df2[:df1.shape[0]]) #This will give you a boolean frame
df2 = df2[Bool_Data].fillna('mismatch') #Keep this boolean frame to df2
df2['User_id'] = temp['User_id'] #Assign the before temp.
df2 =
User_id User_name User_phn
0 1 Alex mismatch
1 2 mismatch 423412
2 3 Bryan 523412
3 4 mismatch mismatch

How can I use `pivot` to track wins and loses?

Suppose I have some team data as a dataframe df.
home_team home_score away_team away_score
A 3 C 1
B 1 A 0
C 3 B 2
I'd like to a dataframe indicating how many times one team has beat another. So for instance the entry in [1,3] would be the number of times team 1 has beat team 3, but the number in [3,1] would be the number of times team 3 as beat team 1.
This sounds like something df.pivot should be able to do, but I can't seem to get it to do what I would like.
How can I accomplish this using pandas?
Here is a desired output
A B C
A 0 0 1
B 1 0 0
C 0 1 0
This will create a new dataframe with just the winners and loosers. It can be pivoted to created what you are looking for.
I made some additional data to fill in some of the pivot table values
import pandas as pd
data = {'home_team':['A','B','C','A','B','C','A','B','C'],
'home_score':[3,1,3,0,1,2,0,4,0],
'away_team':['C','A','B','B','C','B','C','A','A'],
'away_score':[1,0,2,2,0,3,1,7,1]}
df = pd.DataFrame(d)
# create new dataframe
WL = pd.DataFrame()
WL['winner'] = pd.concat([df.home_team[df.home_score>df.away_score],
df.away_team[df.home_score<df.away_score]], axis=0)
WL['loser'] = pd.concat([df.home_team[df.home_score<df.away_score],
df.away_team[df.home_score>df.away_score]], axis=0)
WL['game'] = 1
# groupby to count the number of win/lose pairs
WL_gb = WL.groupby(['winner','loser']).count().reset_index()
# pivot the data
WL_piv = WL_gb.pivot(index='winner', columns='loser', values='game')

Categories

Resources