How do I subset a dataframe with the contents of another dataframes column?
df1 = pd.DataFrame({"0": ['one', 'two', 'three', 'four'], "Index": [1, 2, 3, 4]})
df2 = pd.DataFrame({"0": ['two', 'two', 'three', 'three']})
a = [i for i in df1['0'] if i in df2['0']]
Results
print(a)
[]
Desired output:
print(a)
0 index
0 two 2
1 three 3
If you really want to keep keep this strange unnecessary list comprehension, you need to pass s.values:
>>> [i for i in df1['0'].values if i in df2['0'].values]
['two', 'three']
But to get directly that your input implies you want, just select entries in df1 where the 0 column value is in df2's 0 column values:
>>> df1[df1['0'].isin(df2['0'])]
0 Index
1 two 2
2 three 3
And you can reset the indices too, invoke df_name.reset_index(drop=True).
Related
I have a DataFrame that has integers for column names that looks like this:
1 2 3 4
Red 7 3 2 9
Blue 3 1 6 4
I'd like to rename the columns. I tried using the following
df = df.rename(columns={'1': 'One', '2': 'Two', '3': 'Three', '4': 'Four'})
However that doesn't change the column names. Do I need to do something else to change column names when they are numbers?
You need to remove the quotes:
df = df.rename(columns={1: 'One', 2: 'Two', 3: 'Three', 4: 'Four'})
What if you use the following:
>>> df.columns = ['One', 'Two', 'Three', 'Four']
>>> df
One Two Three Four
0 7 3 6 9
1 3 1 2 4
You can use two way to change columns name in Pandas DataFrame.
Changing the column name using df.columns attribute.
df.columns = ['One', 'Two', 'Three', 'Four']
Using rename() function
df = df.rename(columns={1: 'One', 2: 'Two', 3: 'Three', 4: 'Four'})
In all the examples and answers on here that I've seen, if there is the need to add an empty row ina Pandas dataframe, all use:
ignore_index=True
What should I do if i want to leave the current index, and append an empty row to the dataframe with a given index?
We using reindex
df.reindex(df.index.values.tolist()+['Yourindex'])
Out[1479]:
A B
0 one Aa
1 one Bb
2 two Cc
Yourindex NaN NaN
Data input
df = pd.DataFrame({'A' : ['one', 'one', 'two'] ,
'B' : ['Aa', 'Bb', 'Cc'] })
So I was trying to understand pandas.dataFrame.groupby() function and I came across this example on the documentation:
In [1]: df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
...: 'foo', 'bar', 'foo', 'foo'],
...: 'B' : ['one', 'one', 'two', 'three',
...: 'two', 'two', 'one', 'three'],
...: 'C' : np.random.randn(8),
...: 'D' : np.random.randn(8)})
...:
In [2]: df
Out[2]:
A B C D
0 foo one 0.469112 -0.861849
1 bar one -0.282863 -2.104569
2 foo two -1.509059 -0.494929
3 bar three -1.135632 1.071804
4 foo two 1.212112 0.721555
5 bar two -0.173215 -0.706771
6 foo one 0.119209 -1.039575
7 foo three -1.044236 0.271860
Not to further explore I did this:
print(df.groupby('B').head())
it outputs the same dataFrame but when I do this:
print(df.groupby('B'))
it gives me this:
<pandas.core.groupby.DataFrameGroupBy object at 0x7f65a585b390>
What does this mean? In a normal dataFrame printing .head() simply outputs the first 5 rows what's happening here?
And also why does printing .head() gives the same output as the dataframe? Shouldn't it be grouped by the elements of the column 'B'?
When you use just
df.groupby('A')
You get a GroupBy object. You haven't applied any function to it at that point. Under the hood, while this definition might not be perfect, you can think of a groupby object as:
An iterator of (group, DataFrame) pairs, for DataFrames, or
An iterator of (group, Series) pairs, for Series.
To illustrate:
df = DataFrame({'A' : [1, 1, 2, 2], 'B' : [1, 2, 3, 4]})
grouped = df.groupby('A')
# each `i` is a tuple of (group, DataFrame)
# so your output here will be a little messy
for i in grouped:
print(i)
(1, A B
0 1 1
1 1 2)
(2, A B
2 2 3
3 2 4)
# this version uses multiple counters
# in a single loop. each `group` is a group, each
# `df` is its corresponding DataFrame
for group, df in grouped:
print('group of A:', group, '\n')
print(df, '\n')
group of A: 1
A B
0 1 1
1 1 2
group of A: 2
A B
2 2 3
3 2 4
# and if you just wanted to visualize the groups,
# your second counter is a "throwaway"
for group, _ in grouped:
print('group of A:', group, '\n')
group of A: 1
group of A: 2
Now as for .head. Just have a look at the docs for that method:
Essentially equivalent to .apply(lambda x: x.head(n))
So here you're actually applying a function to each group of the groupby object. Keep in mind .head(5) is applied to each group (each DataFrame), so because you have less than or equal to 5 rows per group, you get your original DataFrame.
Consider this with the example above. If you use .head(1), you get only the first 1 row of each group:
print(df.groupby('A').head(1))
A B
0 1 1
2 2 3
Say we have the following dataframe:
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C' : randn(8), 'D' : randn(8)})
shown below:
> df
A B C D
0 foo one 0.846192 0.478651
1 bar one 2.352421 0.141416
2 foo two -1.413699 -0.577435
3 bar three 0.569572 -0.508984
4 foo two -1.384092 0.659098
5 bar two 0.845167 -0.381740
6 foo one 3.355336 -0.791471
7 foo three 0.303303 0.452966
And then I do the following:
df2 = df
df = df[df['C']>0]
If you now look at df and df2 you will see that df2 holds the original data, whereas df was updated to only keep the values where C was greater than 0.
I thought Pandas wasn't supposed to make a copy in an assignment like df2 = df and that it would only make copies with either:
df2 = df.copy(deep=True)
df2 = copy.deepcopy(df)
What happened above then? Did df2 = df make a copy? I presume that the answer is no, so it must have been df = df[df['C']>0] that made a copy, and I presume that, if I didn't have df2=df above, there would have been a copy without any reference to it floating in memory. Is that correct?
Note: I read through Returning a view versus a copy and I wonder if the following:
Whenever an array of labels or a boolean vector are involved in the indexing operation, the result will be a copy.
explains this behavior.
It's not that df2 is making the copy, it's that the df = df[df['C'] > 0] is returning a copy.
Just print out the ids and you'll see:
print id(df)
df2 = df
print id(df2)
df = df[df['C'] > 0]
print id(df)
I want to do some analysis on data. So far I am enable to group the columns that I want, now i need to add two columns here is my logic:
import pandas as pd
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'bar'],
'B' : ['one', 'one', 'two', 'two',
'two', 'two', 'one', 'two'],
'C' : [-1,2,3,4,5,6,0,2],
'D' : [-1,2,3,4,5,6,0,2]})
grouped = df.groupby(['A','B']).sum()
print grouped
The output looks like this:
C D
A B
bar one 2 2
two 12 12
foo one -1 -1
two 8 8
[4 rows x 2 columns]
What I need now is two use some addition operation to add column C and D and generate a output like this:
A B Sum
bar one 4
two 24
foo one -2
two 16
Any ideas will really help me as i am new to python
You could define a new column Sum:
In [107]: grouped['Sum'] = grouped['C']+grouped['D']
Now grouped would look like this:
In [108]: grouped
Out[108]:
C D Sum
A B
bar one 2 2 4
two 12 12 24
foo one -1 -1 -2
two 8 8 16
[4 rows x 3 columns]
To select just the Sum column (as a DataFrame use double brackets):
In [109]: grouped[['Sum']]
Out[109]:
Sum
A B
bar one 4
two 24
foo one -2
two 16
[4 rows x 1 columns]