pandas adding grouped data frame to another data frame as row - python

I get following dataframe:
category_name amount
Blades & Razors & Foam 158
Diaper 486
Empty 193
Fem Care 2755
HairCare 3490
Irrelevant 1458
Laundry 889
Oral Care 2921
Others 69
Personal Cleaning Care 1543
Skin Care 645
I want to add it as row to following dataframe that has additional retailer column that is absent with the first dataframe.
categories_columns = ['retailer'] + self.product_list.category_name.unique().tolist()
categories_df = pd.DataFrame(columns=categories_columns)
And if some category is missing I just want zero value.
Any ideas ?

Use set_index to move the category_name column into the index. Then taking the transpose (.T) will move the category_names into the column index:
In [35]: df1
Out[35]:
amount cat
0 0 A
1 1 B
2 2 C
In [36]: df1.set_index('cat').T
Out[36]:
cat A B C
amount 0 1 2
Once the category names (cat, above) are in the column index, you can concatenate
the reshaped DataFrame with the second DataFrame using append or `pd.concat.
pd.concat fills missing values with NaN. Use fillna(0) to replace the NaNs with 0.
import numpy as np
import pandas as pd
df1 = pd.DataFrame({'amount': range(3), 'cat': list('ABC')})
df2 = pd.DataFrame(np.arange(2*4).reshape(2, 4), columns=list('ABCD'))
result = df2.(df1.set_index('cat').T).fillna(0)
print(result)
yields
A B C D
0 0 1 2 3.0
1 4 5 6 7.0
amount 0 1 2 0.0

Just append and replace Nan :
pd.DataFrame(columns=products).append(df.T).fillna(0)

Related

Pandas: Search and match based on two conditions

I am using the code below to make a search on a .csv file and match a column in both files and grab a different column I want and add it as a new column. However, I am trying to make the match based on two columns instead of one. Is there a way to do this?
import pandas as pd
df1 = pd.read_csv("matchone.csv")
df2 = pd.read_csv("comingfrom.csv")
def lookup_prod(ip):
for row in df2.itertuples():
if ip in row[1]:
return row[3]
else:
return '0'
df1['want'] = df1['name'].apply(lookup_prod)
df1[df1.want != '0']
print(df1)
#df1.to_csv('file_name.csv')
The code above makes a search from the column name 'samename' in both files and gets the column I request ([3]) from the df2. I want to make the code make a match for both column 'name' and another column 'price' and only if both columns in both df1 and df2 match then the code take the value on ([3]).
df 1 :
name price value
a 10 35
b 10 21
c 10 33
d 10 20
e 10 88
df 2 :
name price want
a 10 123
b 5 222
c 10 944
d 10 104
e 5 213
When the code is run (asking for the want column from d2, based on both if df1 name = df2 name) the produced result is :
name price value want
a 10 35 123
b 10 21 222
c 10 33 944
d 10 20 104
e 10 88 213
However, what I want is if both df1 name = df2 name and df1 price = df2 price, then take the column df2 want, so the desired result is:
name price value want
a 10 35 123
b 10 21 0
c 10 33 944
d 10 20 104
e 10 88 0
You need to use pandas.DataFrame.merge() method with multiple keys:
df1.merge(df2, on=['name','price'], how='left').fillna(0)
Method represents missing values as NaNs, so that the column's dtype changes to float64 but you can change it back after filling the missed values with 0.
Also please be aware that duplicated combinations of name and price in df2 will appear several times in the result.
If you are matching the two dataframes based on the name and the price, you can use df.where and df.isin
df1['want'] = df2['want'].where(df1[['name','price']].isin(df2).all(axis=1)).fillna('0')
df1
name price value want
0 a 10 35 123.0
1 b 10 21 0
2 c 10 33 944.0
3 d 10 20 104.0
4 e 10 88 0
Expanding on https://stackoverflow.com/a/73830294/20110802:
You can add the validate option to the merge in order to avoid duplication on one side (or both):
pd.merge(df1, df2, on=['name','price'], how='left', validate='1:1').fillna(0)
Also, if the float conversion is a problem for you, one option is to do an inner join first and then pd.concat the result with the "leftover" df1 where you already added a constant valued column. Would look something like:
df_inner = pd.merge(df1, df2, on=['name', 'price'], how='inner', validate='1:1')
merged_pairs = set(zip(df_inner.name, df_inner.price))
df_anti = df1.loc[~pd.Series(zip(df1.name, df1.price)).isin(merged_pairs)]
df_anti['want'] = 0
df_result = pd.concat([df_inner, df_anti]) # perhaps ignore_index=True ?
Looks complicated, but should be quite performant because it filters by set. I think there might be a possibility to set name and price as index, merge on index and then filter by index to not having to do the zip-set-shenanigans, bit I'm no expert on multiindex-handling.
#Try this code it will give you expected results
import pandas as pd
df1 = pd.DataFrame({'name' :['a','b','c','d','e'] ,
'price' :[10,10,10,10,10],
'value' : [35,21,33,20,88]})
df2 = pd.DataFrame({'name' :['a','b','c','d','e'] ,
'price' :[10,5,10,10,5],
'want' : [123,222,944,104 ,213]})
new = pd.merge(df1,df2, how='left', left_on=['name','price'], right_on=['name','price'])
print(new.fillna(0))

assign one column value to another column based on condition in pandas

I want to how we can assign one column value to another column if it has null or 0 value
I have a dataframe like this:
id column1 column2
5263 5400 5400
4354 6567 Null
5656 5456 5456
5565 6768 3489
4500 3490 Null
The Expected Output is
id column1 column2
5263 5400 5400
4354 6567 6567
5656 5456 5456
5565 6768 3489
4500 3490 3490
that is,
if df['column2'] = Null/0 then it has take df['column1'] value.
Can someone explain, how can I achieve my desired output?
Based on the answers to this similar question, you can do the following:
Using np.where:
df['column2'] = np.where((df['column2'] == 'Null') | (df['column2'] == 0), df['column1'], df['column2'])
Instead, using only pandas and Python:
df['column2'][(df['column2'] == 0) | (df['column2'] == 'Null')] = df['column1']
Here's my suggestion. Not sure whether it is the fastest, but it should work here ;)
#we start by creating an empty list
column2 = []
#for each row in the dataframe
for i in df.index:
# if the value col2 is null or 0, then it takes the value of col1
if df.loc[i, 'column2'] in ['null', 0]:
column2.append(df.loc[i, 'column1'])
#else it takes the value of column 2
else:
column2.append(df.loc[i, 'column2'])
#we replace the current column 2 by the new one !
df['column2'] = column2```
Update using only Native Pandas Functionality
#Creates boolean array conditionCheck, checking conditions for each row in df
#Where() will only update when conditionCheck == False, so inverted boolean values using "~"
conditionCheck = ~((df['column2'].isna()) | (df['column2']==0))
df["column2"].where(conditionCheck,df["column1"],inplace=True)
print(df)
Code to Generate Sample DataFrame
Changed row 3 of column2 to 0 to test all scenarios
import numpy as np
import pandas as pd
data = [
[5263,5400,5400]
,[4354,6567,None]
,[5656,5456,0]
,[5565,6768,3489]
,[4500,3490,None]
]
df = pd.DataFrame(data,columns=["id","column1","column2"],dtype=pd.Int64Dtype())
Similar question was already solved here.
"Null" keyword does not exist in python. Empty cells in pandas have np.nan type. So assuming you mean np.nans, one good way to achieve your desired output would be:
Create a boolean mask to select rows with np.nan or 0 value and then copy when mask is True.
mask = (df['column2'].isna()) | (df['column2']==0)
df.loc[mask, "column2"] = df.loc[mask, "column1"]
Just use ffill(). Go through the example.
from pandas import DataFrame as df
import numpy as np
import pandas as pd
items = [1,2,3,4,5]
place = [6,7,8,9,10]
quality = [11,np.nan,12,13,np.nan]
df = pd.DataFrame({"A":items, "B":place, "C":quality})
print(df)
"""
A B C
0 1 6 11.0
1 2 7 NaN
2 3 8 12.0
3 4 9 13.0
4 5 10 NaN
"""
aa = df.ffill(axis=1).astype(int)
print(aa)
"""
A B C
0 1 6 11
1 2 7 7
2 3 8 12
3 4 9 13
4 5 10 10
"""

How to fill a column under certain conditions?

I have two data frames df (with 15000 rows) and df1 ( with 20000 rows)
Where df looks like
Number Color Code Quantity
1 Red 12380 2
2 Bleu 14440 3
3 Red 15601 1
and df1 that has two columns Code and Quantity where I want to fill Quantity column under certain conditions using python in order to obtain like this
Code Quantity
12380 2
15601 1
15640 1
14400 0
The conditions that I want to take in considerations are:
If the two last caracters of Code column of df1 are both equal to zero, in this case I want to have 0 in the Quantity column of df1
If I don't find the Code in df, in this cas I put 1 in the Quantity column of df1
Otherwise I take the quantity value of df
Let us try:
mask = df1['Code'].astype(str).str[-2:].eq('00')
mapped = df1['Code'].map(df.set_index('Code')['Quantity'])
df1['Quantity'] = mapped.mask(mask, 0).fillna(1)
Details:
Create a boolean mask specifying the condition where the last two characters of Code are both 0:
>>> mask
0 False
1 False
2 False
3 True
Name: Code, dtype: bool
Using Series.map map the values in Code column in df1 to the Quantity column in df based on the matching Code:
>>> mapped
0 2.0
1 1.0
2 NaN
3 NaN
Name: Code, dtype: float64
mask the values in the above mapped column where the boolean mask is True, and lastly fill the NaN values with 1:
>>> df1
Code Quantity
0 12380 2.0
1 15601 1.0
2 15640 1.0
3 14400 0.0

Populating a column based on values in another column - pandas

After merging two data frames I have some gaps in my data frame that can be filled in based on neighboring columns (I have many more columns, and rows in the DF but I'm focusing on these three columns):
Example DF:
Unique ID | Type | Location
A 1 Land
A NaN NaN
B 2 sub
B NaN NaN
C 3 Land
C 3 Land
Ultimately I want the three columns to be filled in:
Unique ID | Type | Location
A 1 Land
A 1 Land
B 2 sub
B 2 sub
C 3 Land
C 3 Land
I've tried:
df.loc[df.Type.isnull(), 'Type'] = df.loc[df.Type.isnull(), 'Unique ID'].map(df.loc[df.Type.notnull()].set_index('Unique ID')['Type'])
but it throws:
InvalidIndexError: Reindexing only valid with uniquely valued Index objects
What am I missing here? - Thanks
Your example indicates that you want to forward-fill. YOu can do it like this (complete code):
import pandas as pd
from io import StringIO
clientdata = '''ID N T
A 1 Land
A NaN NaN
B 2 sub
B NaN NaN
C 3 Land
C 3 Land'''
df = pd.read_csv(StringIO(clientdata), sep='\s+')
df["N"] = df["N"].fillna(method="ffill")
df["T"] = df["T"].fillna(method="ffill")
print(df)
The best solution is to probably just get rid of the NaN rows instead of overwriting them. Pandas has a simple command for that:
df.dropna()
Here's the documentation for it: pandas.DataFrame.dropna

Sort pandas dataframe both on values of a column and index?

Is it feasible to sort pandas dataframe by values of a column, but also by index?
If you sort a pandas dataframe by values of a column, you can get the resultant dataframe sorted by the column, but unfortunately, you see the order of your dataframe's index messy within the same value of a sorted column.
So, can I sort a dataframe by a column, such as the column named count but also sort it by the value of index? And is it also feasible to sort a column by descending order, but whereas sort a index by ascending order?
I know how to sort multiple columns in dataframe, and also know I can achieve what I'm asking here by first reset_index() the index and sort it, and then create the index again. But is it more intuitive and efficient way to do it?
Pandas 0.23 finally gets you there :-D
You can now pass index names (and not only column names) as parameters to sort_values. So, this one-liner works:
df = df.sort_values(by = ['MyCol', 'MyIdx'], ascending = [False, True])
And if your index is currently unnamed:
df = df.rename_axis('MyIdx').sort_values(by = ['MyCol', 'MyIdx'], ascending = [False, True])
In pandas 0.23+ you can do it directly - see OmerB's answer. If you don't yet have 0.23+, read on.
I'd venture that the simplest way is to just copy your index over to a column, and then sort by both.
df['colFromIndex'] = df.index
df = df.sort(['count', 'colFromIndex'])
I'd also prefer to be able to just do something like df.sort(['count', 'index']), but of course that doesn't work.
As of pandas version 0.22.
You can temporarily set the column as an index, sort the index on that column and then reset. By default it will maintain the order of the existing index:
df = df.set_index('column_name', append=True).sort_index(level=1).reset_index(level=1)
I think the above could be done with 'inplace' options but I think it's easier to read as above.
You can use the ascending parameter in sort_index, but you must pass it as a list for it to work correctly as of pandas 0.22.0.
import pandas as pd
import numpy as np
df = pd.DataFrame({'idx_0':[2]*6+[1]*5,
'idx_1':[6,4,2,10,18,5,11,1,7,9,3],
'value_1':np.arange(11,0,-1),
'MyName':list('SORTEDFRAME')})
df = df.set_index(['idx_0','idx_1'])
df
Output:
MyName value_1
idx_0 idx_1
2 6 S 11
4 O 10
2 R 9
10 T 8
18 E 7
5 D 6
1 11 F 5
1 R 4
7 A 3
9 M 2
3 E 1
Sorting by values and index should get "FRAMESORTED" instead of "SORTEDFRAME"
df.sort_values('value_1', ascending=False)\
.sort_index(level=0, ascending=[True])
Output:
MyName value_1
idx_0 idx_1
1 11 F 5
1 R 4
7 A 3
9 M 2
3 E 1
2 6 S 11
4 O 10
2 R 9
10 T 8
18 E 7
5 D 6
Note you must pass ascending parameter in sort_index as a list and not as a scalar. It will not work.
To sort a column descending, while maintaining the index ascending:
import pandas as pd
df = pd.DataFrame(index=range(5), data={'c': [4,2,2,4,2]})
df.index = df.index[::-1]
print df.sort(column='c', ascending=False)
Output:
c
1 4
4 4
0 2
2 2
3 2
You can use a combination of groupby and apply:
In [2]: df = pd.DataFrame({
'transID': range(8),
'Location': ['New York','Chicago','New York','New York','Atlanta','Los Angeles',
'Chicago','Atlanta'],
'Sales': np.random.randint(0,10000,8)}).set_index('transID')
In [3]: df
Out[3]:
Location Sales
transID
0 New York 1082
1 Chicago 1664
2 New York 692
3 New York 5669
4 Atlanta 7715
5 Los Angeles 987
6 Chicago 4085
7 Atlanta 2927
In [4]: df.groupby('Location').apply(lambda d: d.sort()).reset_index('Location',drop=True)
Out[4]:
Location Sales
transID
4 Atlanta 7715
7 Atlanta 2927
1 Chicago 1664
6 Chicago 4085
5 Los Angeles 987
0 New York 1082
2 New York 692
3 New York 5669
I drop 'Location' at in the last line because groupby inserts the grouped levels into the first positions in the index. Sorting and then dropping them preserves the sorted order.

Categories

Resources