I have a Pandas dataframe with below columns:
id start end
1 101 101
2 102 104
3 108 109
I want to fill the gaps between start and end with additional rows, so the output may look like this:
id number
1 101
2 102
2 103
2 104
3 108
3 109
Is there anyway to do it in Pandas? Thanks.
Use nested list comprehension with range and flattening for list of tuples, last use DataFrame constructor:
zipped = zip(df['id'], df['start'], df['end'])
df = pd.DataFrame([(i, y) for i, s, e in zipped for y in range(s, e+1)],
columns=['id','number'])
print (df)
id number
0 1 101
1 2 102
2 2 103
3 2 104
4 3 108
5 3 109
Here is a pure pandas solution but performance-wise, #jaezrael's solution would be better,
df.set_index('id').apply(lambda x: pd.Series(np.arange(x.start, x.end + 1)), axis = 1)\
.stack().astype(int).reset_index()\
.drop('level_1', 1)\
.rename(columns = {0:'Number'})
id Number
0 1 101
1 2 102
2 2 103
3 2 104
4 3 108
5 3 109
Related
I have a column with numbers and one of these characters between them -,/,*,~,_. I need to check if values contain any of the characters, then split the value in another column. Is there a different solution than shown below? In the end, columns subnumber1, subnumber2 ...subnumber5 will be merged in one column and column "number5" will be without characters. Those two columns I need to use in further process. I'm a newbie in Python so any advice is welcome.
if gdf['column_name'].str.contains('~').any():
gdf[['number1', 'subnumber1']] = gdf['column_name'].str.split('~', expand=True)
gdf
if gdf['column_name'].str.contains('^').any():
gdf[['number2', 'subnumber2']] = gdf['column_name'].str.split('^', expand=True)
gdf
Input column:
column_name
152/6*3
163/1-6
145/1
163/6^3
output:
number5 |subnumber1 |subnumber2
152 | 6 | 3
163 | 1 | 6
145 | 1 |
163 | 6 | 3
Use Series.str.split with list of possible separators and create new DataFrame:
import re
L = ['-','/','*','~','_','^', '.']
#some values like `^.` are escape
pat = '|'.join(re.escape(x) for x in L)
df = df['column_name'].str.split(pat, expand=True).add_prefix('num')
print (df)
num0 num1 num2
0 152 6 3
1 163 1 6
2 145 1 None
3 163 6 3
EDIT: If need match values before value use:
L = ["\-_",'\^|\*','~','/']
for val in L:
df[f'before {val}'] = df['column_name'].str.extract(rf'(\d+){[val]}')
#for last value not exist separator, so match $ for end of string
df['last'] = df['column_name'].str.extract(rf'(\d+)$')
print (df)
column_name before \-_ before \^|\* before ~ before / last
0 152/2~3_4*5 3 4 2 152 5
1 152/2~3-4^5 4 4 2 152 5
2 152/6*3 NaN 6 NaN 152 3
3 163/1-6 NaN NaN NaN 163 6
4 145/1 NaN NaN NaN 145 1
5 163/6^3 6 6 NaN 163 3
Use str.split:
df['column_name'].str.split(r'[*,-/^_]', expand=True)
output:
0 1 2
0 152 6 3
1 163 1 6
2 145 1 None
3 163 6 3
Or, if you know in advance that you have 3 numbers, use str.extract and named capturing groups:
regex = '(?P<number5>\d+)\D*(?P<subnumber1>\d*)\D*(?P<subnumber2>\d*)'
df['column_name'].str.extract(regex)
output:
number5 subnumber1 subnumber2
0 152 6 3
1 163 1 6
2 145 1
3 163 6 3
Suppose to have two dataframes, df1 and df2, with equal number of columns, but different number of rows, e.g:
df1 = pd.DataFrame([(1,2),(3,4),(5,6),(7,8),(9,10),(11,12)], columns=['a','b'])
a b
1 1 2
2 3 4
3 5 6
4 7 8
5 9 10
6 11 12
df2 = pd.DataFrame([(100,200),(300,400),(500,600)], columns=['a','b'])
a b
1 100 200
2 300 400
3 500 600
I would like to add df2 to the df1 tail (df1.loc[df2.shape[0]:]), thus obtaining:
a b
1 1 2
2 3 4
3 5 6
4 107 208
5 309 410
6 511 612
Any idea?
Thanks!
If there is more rows in df1 like in df2 rows is possible use DataFrame.iloc with convert values to numpy array for avoid alignment (different indices create NaNs):
df1.iloc[-df2.shape[0]:] += df2.to_numpy()
print (df1)
a b
0 1 2
1 3 4
2 5 6
3 107 208
4 309 410
5 511 612
For general solution working with any number of rows with unique indices in both Dataframe with rename and DataFrame.add:
df = df1.add(df2.rename(dict(zip(df2.index[::-1], df1.index[::-1]))), fill_value=0)
print (df)
a b
0 1.0 2.0
1 3.0 4.0
2 5.0 6.0
3 107.0 208.0
4 309.0 410.0
5 511.0 612.0
I have posted two sample dataframes. I would like to map one column of a dataframe with respect to the index of a column in another dataframe and place the values back to the first dataframe shown as below
A = np.array([0,1,1,3,5,2,5,4,2,0])
B = np.array([55,75,86,98,100,111])
df1 = pd.Series(A, name='data').to_frame()
df2 = pd.Series(B, name='values_for_replacement').to_frame()
The below is the first dataframe df1
data
0 0
1 1
2 1
3 3
4 5
5 2
6 5
7 4
8 2
9 0
And the below is the second dataframe df2
values_for_replacement
0 55
1 75
2 86
3 98
4 100
5 111
The below is the output needed (Mapped with respect to the index of the df2)
data new_data
0 0 55
1 1 75
2 1 75
3 3 98
4 5 111
5 2 86
6 5 111
7 4 100
8 2 86
9 0 55
I would kindly like to know how one can achieve this using some pandas functions like map.
Looking forward for some answers. Many thanks in advance
I want to count index elements by userID.My dataframe
gender adID rating
userID
1 m 100 50
1 m 101 100
1 m 102 0
2 f 100 50
2 f 101 100
3 m 102 62
3 m 107 28
3 m 101 36
2 f 102 74
2 f 107 100
4 m 101 62
4 m 102 28
5 f 109 50
5 f 110 100
6 m 103 50
6 m 104 100
6 m 105 0
I tried
df.count('userID')
But got output
File "/home/mm/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 5630, in count
axis = self._get_axis_number(axis)
File "/home/mm/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py", line 357, in _get_axis_number
.format(axis, type(self)))
ValueError: No axis named adID for object type <class 'pandas.core.frame.DataFrame'>
How to fix this?Do the index operations follow the same principles as the column ones?
You have to use df.index, but some pandas function are not implemented, so is possible use to_series or Series contructor:
a = df.index.value_counts()
print (a)
2 4
6 3
3 3
1 3
5 2
4 2
Name: userID, dtype: int64
b = len(df.index)
print (b)
17
c = df.index.to_series().mode()
print (c)
0 2
dtype: int64
I have a table like this:
In [2]: df = pd.DataFrame({
...: 'donorID':[101,101,101,102,103,101,101,102,103],
...: 'recipientID':[11,11,21,21,31,11,21,31,31],
...: 'amount':[100,200,500,200,200,300,200,200,100],
...: 'year':[2014,2014,2014,2014,2014,2015,2015,2015,2015]
...: })
In [3]: df
Out[3]:
amount donorID recipientID year
0 100 101 11 2014
1 200 101 11 2014
2 500 101 21 2014
3 200 102 21 2014
4 200 103 31 2014
5 300 101 11 2015
6 200 101 21 2015
7 200 102 31 2015
8 100 103 31 2015
I'd like to count the number of donor-recipient pairs by donor (donations made by the same donor to the same recipient in n years, where n could be any number and it doesn't have to be consecutive, but I use 2 here to keep things simple). In this case, donor 101 donated to recipient 11 and 21 in 2014 as well as in 2015, the count for 101 is 2. The number for 102 is 0, and for 103 is 1. The result table would look like this:
donorID num_donation_2_years
0 101 2
1 102 0
2 103 1
I've tried to use groupby and pivot_table but didn't manage to get the right answer. Any suggestion in pandas would be appreciated? Thanks!
Something like
df1=df.groupby('donorID').apply(lambda x : x.groupby(x.recipientID).year.nunique().gt(1).sum())
df1
Out[102]:
donorID
101 2
102 0
103 1
dtype: int64
To get the dataframe
df1.to_frame('num_donation_2_years').reset_index()
Out[104]:
donorID num_donation_2_years
0 101 2
1 102 0
2 103 1
As Dark mention do not using apply
This is the update
df1=df.groupby(['donorID','recipientID']).year.nunique().gt(1).sum(level=0)
df1
Out[109]:
donorID
101 2.0
102 0.0
103 1.0
Name: year, dtype: float64
df1.to_frame('num_donation_2_years').reset_index()
Out[104]:
donorID num_donation_2_years
0 101 2
1 102 0
2 103 1
An improvement to #Wen's solution, avoiding apply for more speed i.e
one = df.groupby(['donorID','recipientID'])['year'].nunique().gt(1)
two = one.groupby(level=0).sum().to_frame('no_of_donations_2_years').reset_index()
donorID no_of_donations_2_years
0 101 2.0
1 102 0.0
2 103 1.0
df_new = df.groupby(["donorID", "recipientID"])["year"].nunique().reset_index(name="year_count")
df_for_query = df_new.groupby(["donorID", "year_count"]).size().reset_index(name='numb_recipient')
donorID year_count numb_recipient
0 101 2 2
1 102 1 2
2 103 2 1
The third column is how many patients that fit the year condition. The line 0 says donor 101 has 2 patients that he/she donates in exactly two years. This is not exactly your output, but you can query it easily from this df.
If you want to find that the number of patients a donor donates for some number of year, say 2, run
df_for_query.query("year_count == 2")
donorID year_count numb_recipient
0 101 2 2
2 103 2 1
Thanks for Wen's inspiration for using nunique!
Following code works (explanation as comments) (ol for outlist):
# count frequency of donor-recipient combination
ol = pd.value_counts(df.apply(lambda x: str(x.donorID)+str(x.recipientID), axis=1))
ol = ol[ol>=2] # choose only those >= 2
ol.index = list(map(lambda x: x[:3], ol.index)) # get donorID name again
print(pd.value_counts(ol.index)) # print desired frequency
Output:
101 2
103 1
dtype: int64