Efficiently integrate a series into a pandas dataframe - python

I have a pandas dataframe with index [0, 1, 2...], and a list something like this: [1, 2, 2, 0, 1...].
I'd like to add a 'count' column to the dataframe, that reflects the number of times the digit in the index is referenced in the list.
Given the example lists above, the 'count' column would have the value 2 at index 2, because 2 occurred twice (so far). Is there a more efficient way to do this than iterating over the list?

Well here is a way of doing it, first load the list into a df, then add the 'occurrence' column using value_counts and then merge this to your orig df:
In [61]:
df = pd.DataFrame({'a':np.arange(10)})
l=[1,2,2,0,1]
df1 = pd.DataFrame(l, columns=['data'])
df1['occurence'] = df1['data'].map(df1['data'].value_counts())
df1
Out[61]:
data occurence
0 1 2
1 2 2
2 2 2
3 0 1
4 1 2
In [65]:
df.merge(s, left_index=True, right_on='data',how='left').fillna(0).drop_duplicates().reset_index(drop=True)
Out[65]:
a data count
0 0 0 1
1 1 1 2
2 2 2 2
3 3 3 0
4 4 4 0
5 5 5 0
6 6 6 0
7 7 7 0
8 8 8 0
9 9 9 0

Counting occurences of numbers in a dataframe is easy in pandas
You just use the Series.value_counts method.
Then you join the grouped dataframe with the original one using the pandas.merge function.
Setting up a DataFrame like the one you have:
df = pd.DataFrame({'nomnom':np.random.choice(['cookies', 'biscuits', 'cake', 'lie'], 10)})
df is now a DataFrame with some arbitrary data in it (since you said you had more data in there).
nomnom
0 biscuits
1 lie
2 biscuits
3 cake
4 lie
5 cookies
6 cake
7 cake
8 cake
9 cake
Setting up a list like the one you have:
yourlist = np.random.choice(10, 10)
yourlist is now:
array([2, 9, 2, 3, 4, 8, 5, 8, 6, 8])
The actual code you need (TLDR;):
counts = pd.DataFrame(pd.value_counts(yourlist))
pd.merge(left=df, left_index=True,
right=counts, right_index=True,
how='left').fillna(0)

Related

Get the index of n maximum values in a column in dataframe

I have a data frame and I want to get the index and value of the 4 maximum values in each rows. For example, in the following df, in column a, 10, 6, 7, 8 are four maximum values.
import pandas as pd
df = pd.DataFrame()
df['a'] = [10, 2, 3, -1,4,5,6,7,8]
df['id'] = [100, 2, 3, -1,4,5,0,1,2]
df
The output which I want is:
Try nlargest,
df.nlargest(4, 'a').reset_index()
Output:
index a id
0 0 10 100
1 8 8 2
2 7 7 1
3 6 6 0
You can sort the a column
out = (df.sort_values('a', ascending=False).iloc[:4]
.sort_index(ascending=True)
.reset_index())
print(out)
index a id
0 0 10 100
1 6 6 0
2 7 7 1
3 8 8 2

delete the first n rows of each ids in dataframe

I have a DataFrame, with two columns. I want to delete the first 3 rows values of each ids. If the id has less or equal to three rows, delete those rows also. Like in the following, the ids 3 and 1 have 3 and 2 rows, sod they should be deleted. for ids 4 and 2, only the rows 4, 5 are preserved.
import pandas as pd
df = pd.DataFrame()
df ['id'] = [4,4,4,4, 4,2, 2,2,2,2,3,3,3, 1, 1]
df ['value'] = [2,1,1,2, 3, 4, 6,-1,-2,2,-3,5,7, -2, 5]
Here is the DataFrame which I want.
Number each "id" using groupby + cumcount and filter the rows where the the number is more than 2:
out = df[df.groupby('id').cumcount() > 2]
Output:
id value
3 4 2
4 4 3
8 2 -2
9 2 2
Use Series.value_counts and Series.map in order to performance a boolean indexing
new_df = df[df['id'].map(df['id'].value_counts().gt(2))]
id value
3 4 2
4 4 3
8 2 -2
9 2 2
Using cumcount is the way but with drop work as well
out = df.groupby('id',sort=False).apply(lambda x : x.drop(x.index[:3])).reset_index(drop=True)
Out[12]:
id value
0 4 2
1 4 3
2 2 -2
3 2 2

Append Data to Pandas Dataframe

I have the following pandas dataframe:
df = pd.DataFrame({'A': [1, 2, 3, 7], 'B': [4, 5, 6, 29]})
I'm working on a for loop that grabs an index and then appends data to the end of that row.
How do I append columns C, D, E for a given index of the table? Let's say on iteration one, the index is 2:
A B C D E
0 1 4 0 0 0
1 2 5 0 0 0
2 3 6 34 12 23
3 7 29 0 0 0
On the next iteration of the for loop, the index might be 1. Then the dataframe would be:
A B C D E
0 1 4 0 0 0
1 2 5 8 11 4
2 3 6 34 12 23
3 7 29 0 0 0
How do I do this?
You can target specific rows by using loc and providing the index.
For example:
df.loc[5:'D']=10
This will add the value 10, to the column D of row index 5.
Your question states that you want to add new columns depending on the row index. This doesn't make sense, because a dataframe is not like a NoSQL document where you can just add columns independent of other rows.
What you should do is have all your columns already added to your dataframe, then add values as you go.
To add multiple values:
df.loc[5, ['D', 'B']] = 10

How to manually arrange rows in pandas dataframe

I have a small dataframe produced from value_counts() that I want to plot with a categorical x axis. It s a bit bigger than this but:
Age Income
25-30 10
65-70 5
35-40 2
I want to be able to manually reorder the rows. How do I do this?
You can reorder rows with .reindex:
>>> df
a b
0 1 4
1 2 5
2 3 6
>>> df.reindex([1, 2, 0])
a b
1 2 5
2 3 6
0 1 4
From here Link, you can create a sorting criteria and use that:
df = pd.DataFrame({'Age':['25-30','65-70','35-40'],'Income':[10,5,2]})
sort_criteria = {'25-30': 0, '35-40': 1, '65-70': 2}
df = df.loc[df['Age'].map(sort_criteria).sort_values(ascending = True).index]

After creating NAN category, aggregation on groupby object went wrong

I first create some data:
df = pd.DataFrame(data = {"A":np.random.random_integers(1,10,10), "B":np.arange(1,11,1)})
df.A.ix[3,4] = np.nan
Then I got a pd dataframe with Nans
A B
0 7 1
1 1 2
2 3 3
3 NaN 4
4 NaN 5
5 9 6
6 2 7
7 10 8
8 6 9
9 6 10
I try to group column A using pd.cut function add use aggregation functions on each group
bin_S = pd.cut(df.A, [-math.inf, 3,5,8,9, math.inf],right= False)
df.groupby(bin_S).agg("count")
But the Nan values are not grouped( no Nan category)
A B
A
[-inf, 3) 2 2
[3, 5) 1 1
[5, 8) 3 3
[8, 9) 0 0
[9, inf) 2 2
Then I tried to add a new category called "Missing" by:
bin_S.cat.add_categories("Missing", inplace = True)
bin_S.fillna(value = "Missing", inplace = True
The binning series looks fine. However, the groupby aggregation is not what I expected.
df.groupby(bin_S).agg("count")
Result is,
A B
A
[-inf, 3) 2 2
[3, 5) 1 1
[5, 8) 3 3
[8, 9) 0 0
[9, inf) 2 2
Missing 0 2
I am expecting column A and column B to be exactly the same. Why they are different on row "Missing"? The real problem involves more complicated operation on each group. This issue really bothers me since grouping Nan values might be unreliable.
'count' is going to skip NaN. You can use 'size'
df.groupby(bin_S).agg(["size"])

Categories

Resources