Create chronology column in pandas DataFrame [duplicate] - python

This question already has answers here:
How to make an order column when grouping by another column
(2 answers)
Add column to dataframe with the order parameter of a groupby
(1 answer)
Closed 5 months ago.
I have a dataframe characterized by two essential columns: name and timestamp.
df = pd.DataFrame({'name':['tom','tom','tom','bert','bert','sam'], \
'timestamp':[15,13,14,23,22,14]})
I would like to create a third column chronology that checks the timestamp for each name and gives me the chronological order per name such that the final product looks like this:
df_final = pd.DataFrame({'name':['tom','tom','tom','bert','bert','sam'], \
'timestamp':[15,13,14,23,22,14], \
'chronology':[3,2,1,2,1,1]})
I understand that I can go df = df.sort_values(['name', 'timestamp']) but how do I create the chronology column?

You can do with groupby().cumcount() if the timestamps are not likely repeated:
df['chronology']= df.sort_values('timestamp').groupby('name').cumcount().add(1)
or groupby().rank():
df['chronology'] = df.groupby('name')['timestamp'].rank().astype(int)
Output:
name timestamp chronology
0 tom 15 3
1 tom 13 1
2 tom 14 2
3 bert 23 2
4 bert 22 1
5 sam 14 1

The function GroupBy.rank(), does exactly what you need. From the documentation:
GroupBy.rank(method='average', ascending=True, na_option='keep', pct=False, axis=0)
Provide the rank of values within each group.
Try this code:
df['chronology'] = df.groupby(by=['name']).timestamp.rank().astype(int)
Result:
name timestamp chronology
tom 15 3
tom 13 1
tom 14 2
bert 23 2
bert 22 1
sam 14 1

Related

Is there a case sensitive method to filter columns in a dataframe by header? [duplicate]

This question already has answers here:
How to select all columns whose names start with X in a pandas DataFrame
(11 answers)
Closed 2 years ago.
I have a dataframe with multiple columns and different headers.
I want to filter the dataframe to keep only the columns that start with the letter I. Some of my column headers have the letter i but start with a different letter.
Is there a way to do this?
I tried using df.filter but for some reason, it's not case sensitive.
You can use df.filter with the regex parameter:
df.filter(regex=r'(?i)^i')
this will return columns starting with I ignoring the case.
Regex Demo
Example below:
Lets consider the input dataframe:
df = pd.DataFrame(np.random.randint(0,20,(5,4)),
columns=['itest','Itest','another','anothericol'])
print(df)
itest Itest another anothericol
0 1 4 14 17
1 17 10 14 1
2 16 18 10 7
3 10 12 17 14
4 6 15 17 19
With df.filter
print(df.filter(regex=r'(?i)^i'))
itest Itest
0 1 4
1 17 10
2 16 18
3 10 12
4 6 15

How to groupby and aggregate joining values as a string [duplicate]

This question already has answers here:
Concatenate strings from several rows using Pandas groupby
(8 answers)
Closed 2 years ago.
I have a dataframe structured like this:
df_have = pd.DataFrame({'id':[1,1,2,3,4,4], 'desc':['yes','no','chair','bird','person','car']})
How can I get something like this:
df_want = pd.DataFrame({'id':[1,2,3,4], 'desc':['yes no','chair','bird','person car']})
Use groupby().apply:
df_have.groupby('id', as_index=False)['desc'].apply(' '.join)
Output:
id desc
0 1 yes no
1 2 chair
2 3 bird
3 4 person car
I will do agg with groupby
df = df_have.groupby('id',as_index=False)[['desc']].agg(' '.join)
id desc
0 1 yes no
1 2 chair
2 3 bird
3 4 person car

How to average DataFrame row with another row only if the first row is a substring of other next row

I have a dataframe called 'data':
USER VALUE
XOXO 21
ABC-1 2
ABC-1B 4
ABC-2 4
ABC-2B 6
PEPE 12
I want to combine 'ABC-1' with 'ABC-1B' into a single row using the first USER name and then averaging the two values to arrive here:
USER VALUE
XOXO 21
ABC-1 3
ABC-2 5
PEPE 12
The dataframe may not be in order and there are other values in there as well that are unrelated that don't need averaging. I only want to average the two rows where 'XXX-X' is in 'XXX-XB'
data = pd.DataFrame({'USER':['XOXO','ABC-1','ABC-1B','ABC-2','ABC-2B', 'PEPE'], 'VALUE':[21,2,4,4,6,12]})
Let's try,
df.USER = df.USER.str.replace('(-\d)B', r"\1")
df = df.groupby("USER", as_index=False, sort=False).VALUE.mean()
print(df)
USER VALUE
0 XOXO 21
1 ABC-1 3
2 ABC-2 5
3 PEPE 12

Import and combine 2 csv and align with a common column id

I have 2 data sets one with distinct number of rows and columns but has common id's.
Question: I want both data frames to be combined to form a new dataframe that has same number of df1 rows but added extra Age column, values in age columns to be filled as per the id
Example:
data = [[1,'Alex',10],[2,'Bob',12],[3,'Clarke',13],[1,'Alex',13],[4,'Jim',13], [3,'Clarke',13]]
df1 = pd.DataFrame(data,columns=['id', 'Name','Score'],dtype=int)
data2 = [[1, 20],[2, 22],[3, 19],[4, 21]]
df2 = pd.DataFrame(data2,columns=['id','Age'],dtype=int)
Out:
No clue where to start
New to python, please help!
Expected Output:
id Name Score Age
0 1 Alex 10 20
1 2 Bob 12 22
2 3 Clarke 13 19
3 1 Alex 13 20
4 4 Jim 13 21
5 3 Clarke 13 19
Try this one:
>>> pd.merge(df1, df2, on="id")
id Name Score Age
0 1 Alex 10 20
1 1 Alex 13 20
2 2 Bob 12 22
3 3 Clarke 13 19
4 3 Clarke 13 19
5 4 Jim 13 21
Try "merge".
You should be able to join both csv's by writing:
combined_data = df1.merge(df2, on="id")
The merge function combines the tables, and the "on" parameter determined on what condition to merge them on.
You use the merge function to merge two dataframe with equal lengths if they have atleast one column in common. In your case it's the ID. So we merge it 'on' ID like so:
data = [[1,'Alex',10],[2,'Bob',12],[3,'Clarke',13],[1,'Alex',13],[4,'Jim',13], [3,'Clarke',13]]
df1 = pd.DataFrame(data,columns=['id', 'Name','Score'],dtype=int)
data2 = [[1, 20],[2, 22],[3, 19],[4, 21]]
df2 = pd.DataFrame(data2,columns=['id','Age'],dtype=int)
merged_df = df1.merge(df2, on="id")

Python: drop value=0 row in specific columns [duplicate]

This question already has answers here:
How to delete rows from a pandas DataFrame based on a conditional expression [duplicate]
(6 answers)
How do you filter pandas dataframes by multiple columns?
(10 answers)
Closed 4 years ago.
I want to drop rows with zero value in specific columns
>>> df
salary age gender
0 10000 23 1
1 15000 34 0
2 23000 21 1
3 0 20 0
4 28500 0 1
5 35000 37 1
some data in columns salary and age are missing
and the third column, gender is a binary variables, which 1 means male 0 means female. And 0 here is not a missing data,
I want to drop the row in either salary or age is missing
so I can get
>>> df
salary age gender
0 10000 23 1
1 15000 34 0
2 23000 21 1
3 35000 37 1
Option 1
You can filter your dataframe using pd.DataFrame.loc:
df = df.loc[~((df['salary'] == 0) | (df['age'] == 0))]
Option 2
Or a smarter way to implement your logic:
df = df.loc[df['salary'] * df['age'] != 0]
This works because if either salary or age are 0, their product will also be 0.
Option 3
The following method can be easily extended to several columns:
df.loc[(df[['a', 'b']] != 0).all(axis=1)]
Explanation
In all 3 cases, Boolean arrays are generated which are used to index your dataframe.
All these methods can be further optimised by using numpy representation, e.g. df['salary'].values.

Categories

Resources