I created a df from the data of my excel sheet and in a specific column I have a lot of values that are the same, but some of then are different. What I want to do is find in what row these different values are and associate each one with another value from the same row. I will give an example:
ColA ColB
'Ship' 5
'Ship' 5
'Car' 3
'Ship' 5
'Plane' 2
Following the example, is there a way to find where the values different from 5 are with the code giving me the respective value from ColA? In this case would be finding 3 and 2, returning for me 'Car' and 'Plane', respectively.
Any help is welcome! :)
It depends on exacty what you want to do, but you could use:
a filter - to filter for the value you seek.
.where - to show values which are False.
Given the above dataframe the following would work:
df['different'] = df['ColB']==5
df['type'] = df['ColA'].where(df['different']==False)
print(df)
Which returns this:
ColA ColB different type
0 Ship 5 True NaN
1 Ship 5 True NaN
2 Car 3 False Car
3 Ship 5 True NaN
4 Plane 2 False Plane
The 4th column has what you seek...
I have a huge df with missing entries in the brand column that needs to be filled according to other rows. if the all other 3 columns match, fill the blanks with the existing brand, else fill with 'Other'.
if this is my starting df:
df_start = pd.DataFrame({'device_id':[1,1,1,1,2,2,3,3,3,3,4,4,4,4],
'head':['a','a','b','b','a','b','a','b','b','b','a','b','c','d'],
'supplement':['Salt','Salt','Pepper','Pepper','Pepper','Pepper','Salt','Pepper','Salt','Pepper','Pepper','Salt','Pepper','Salt'],
'brand':['white',np.nan,np.nan,'white','white','black',np.nan,np.nan,'white','black',np.nan,'white','black',np.nan]})
how to get this result:
df_end = pd.DataFrame({'device_id':[1,1,1,1,2,2,3,3,3,3,4,4,4,4],
'head':['a','a','b','b','a','b','a','b','b','b','a','b','c','d'],
'supplement':['Salt','Salt','Pepper','Pepper','Pepper','Pepper','Salt','Pepper','Salt','Pepper','Pepper','Salt','Pepper','Salt'],
'brand':['white','white','white','white','white','black','Other','black','white','black','Other','white','black','Other']})
You could try with a groupby on the columns that need to be the same, in your case 'device_id', 'head', 'supplement', and use forward fill ffill(), backward fill bfill(), and at the very end you fillna() with 'Other', as the leftovers will be the ones with no identical rows in those 3 columns:
result = df_start.groupby(['device_id','head','supplement'])\
.apply(lambda x: x.ffill().bfill().fillna('Other'))
prints:
>>> result
device_id head supplement brand
0 1 a Salt white
1 1 a Salt white
2 1 b Pepper white
3 1 b Pepper white
4 2 a Pepper white
5 2 b Pepper black
6 3 a Salt Other
7 3 b Pepper black
8 3 b Salt white
9 3 b Pepper black
10 4 a Pepper Other
11 4 b Salt white
12 4 c Pepper black
13 4 d Salt Other
A solution not requiring a group by (costly), based on a simple mapping.
from collections import defaultdict
# create a mapping (ddict with key ('device_id', 'head', 'supplement')
# returns 'Other' when missing key
mapping = defaultdict(lambda: 'Other')
mapping.update(df_start.dropna()\
.set_index(['device_id', 'head', 'supplement'])['brand']\
.to_dict())
# apply function using the mapping to get the brand
brand = df_start.iloc[:, :-1].apply(lambda row: mapping[tuple(row)], axis=1)
You can replace the nan values in the brands column after the creation of the dataframe. This may not be be the most efficient way but is the simplest one.
df['brand'].replace(np.NaN, "Other")
I typically filter a pandas DataFrame using the following syntax:
FDF = DF[DF['Color'] == 'Blue']
I expect to see a result where FDF, which is my filtered DataFrame returns just the rows where the color column is set to blue. Instead, I get something like this. Funny thing is, the program used to worked as expected, but stopped working after I upgraded my operating system and re-installed Python and all of the libraries. Also, it does not do this on all of my DataFrames. Any ideas?
0 Color Shape Data
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 Blue NaN NaN
5 NaN NaN NaN
6 NaN NaN NaN
Edit: I think the first 3 responses misunderstood the question. I am showing the result, not the original DF. My original DF looks like this:
Color Shape Data
0 Green square Y
1 Red triangle N
2 Red circle Y
3 Blue circle N
4 Green square N
5 Red triangle N
The result I am expecting is:
Color Shape Data
3 Blue circle N
('Color', 'Shape', 'Data') is NOT your column names, but your first row of data, otherwise there wouldn't an index 0 assigned to this row. Since DF has no column named 'Color', DF['Color'] == 'Blue' doesn't filter out anything, and therefore it would return all records.
If you imported your data from a csv or Excel sheet, I'd suggest you specify using the first row of your file as column names.
You are missing DF.loc so you are getting unneccessary rows.
Make first row as column header:
DF=DF.rename(columns=DF.iloc[0].drop(DF.index[0])
Then use the below to get just the rows where color is blue :
FDF=DF.loc[DF['Color'] =='Blue']
You're trying to filter by a column name which does not exists. First make your first row as column header :
DF.columns = DF.iloc[0]
DF.reindex(DF.index.drop(0))
Now filter using
FDF = DF[DF['Color'] == 'Blue']
Not sure I understand the negative ratings on this question. However, I was able to work around the issue by assigning a new index and renaming the columns.
I have a rather basic question for pandas, but I've tried merge and join to no success
-edit: these are in the same dataframe, and that wasn't clear. We are indeed condensing the data.
print df
product_code_shipped quantity product_code
0 A12395 1 A12395
1 H53456 4 D78997
2 A13456 3 E78997
3 A12372 8 A13456
4 E28997 1 D83126
5 B78997 2 C64516
6 C78117 9 B78497
7 B78227 1 H53456
8 B78497 2 J12372
So I want to just have one product code column with the unique product codes and their other data. So quantity, and color say, I just want the product codes of the shipped products (and in another column there is color). How do I do this inside the same dataframe?
So I should get
print df2
product_code_shipped quantity product_code color
0 A12395 1 A12395 red
1 H53456 4 H53456 blue
2 B78497 2 B78497 yellow
I'm a little confused by your question, specifically where "unique product codes" enter in...are we condensing the data? The example does not make that clear. Nonetheless I'll give it a shot:
Many DataFrame methods rely on the indexes to automatically align data. In your case, it seems convenient to set the index of these DataFrames to the product code. So you'd have this:
In [132]: shipped
Out[132]:
quantity
product_code_shipped
A 1
B 4
C 2
In [133]: info
Out[133]:
color
product_code
A red
B blue
C yellow
Now, join requires no extra parameters; it gives you exactly what (I think) you want.
In [134]: info.join(shipped)
Out[134]:
color quantity
product_code
A red 1
B blue 4
C yellow 2
If this doesn't answer your question, please clarify it by giving example input including where color comes from and the exact output that would come from that input.