Splitting a dataframe on a specific string value in pandas? [duplicate] - python

This question already has answers here:
Splitting a DataFrame in chunks by value appearance
(2 answers)
Split pandas dataframe based on groupby
(4 answers)
Closed 17 days ago.
I have a dataframe which i have to split as soon as a specific string value in a column occurs.
Ex. df =
txn_details amt
0 opening_balance 13000
1 opening_balance 15000
2 upi2873 12879
3 upi182y31 12301
4 opening_balance 85050
5 upi79279831 8400
The desired output(3 dataframes)(may vary depending on the no. of occurrences of 'opening_balance'):
df_1 =
txn_details amt
0 opening_balance 13000
df_2 =
txn_details amt
0 opening_balance 15000
1 upi2873 12879
2 upi182y31 12301
df_3 =
txn_details amt
0 opening_balance 85050
1 upi79279831 8400
I've tried using cumsum() function in pandas but not getting the desired output.

Compare opening_balance with txn_details with cumulative sum by Series.cumsum and in dictionary comprehensioncreate dict of DataFrames:
d = {f'df_{i}': g.reset_index(drop=True)
for i, g in df.groupby(df['txn_details'].eq('opening_balance').cumsum())}
print (d['df_1'])
txn_details amt
0 opening_balance 13000

Related

How to count no. of rows between time intervals(hourly) in pandas?

My data has various columns including a date and a time column. The data is stretched across three months. I need to count no. of rows in a particular hour irrespective of the date. So that would mean getting the count of rows in 00:00 to 01:00 window and similarly for the rest 23 hours. How do I do that? Overall I will have 24 rows with their counts.
Here is my data:
>>>df[["date","time"]]
date time
0 2006-11-10 00:01:21
1 2006-11-10 00:02:26
2 2006-11-10 00:02:38
3 2006-11-10 00:05:38
4 2006-11-10 00:05:38
Output should be like:
00:00-00:59 SomeCount
Both are object types
I think simpliest is convert both columns to datetimes and count hours by Series.dt.hour with Series.value_counts:
out = pd.to_datetime(df["date"] + ' ' + df["time"]).dt.hour.value_counts().sort_index()
Or if need your format use Series.dt.strftime with GroupBy.size:
s = pd.to_datetime(df["date"] + ' ' + df["time"]).dt.strftime('%H:00-%H:59')
print (s)
0 00:00-00:59
1 00:00-00:59
2 00:00-00:59
3 00:00-00:59
4 00:00-00:59
dtype: object
out = s.groupby(s, sort=False).size()
print (out)
00:00-00:59 5
dtype: int64
Last for DataFrame use:
df = out.rename_axis('times').reset_index(name='count')
You can split the time string with the delimiter :. Then create an another column hour for hour. Then use groupby() to group them on the basis of new column hour. You can now store the data in a new series or dataframe to get the desired output
groupby() the hour
then build DF that has values you want
cleanup indexes and column names
import io
df = pd.read_csv(io.StringIO(""" date time
0 2006-11-10 00:01:21
1 2006-11-10 00:02:26
2 2006-11-10 00:02:38
3 2006-11-10 00:05:38
4 2006-11-10 02:05:38"""), sep="\s\s+", engine="python")
dfc = (df.groupby(pd.to_datetime(df.time).dt.hour)
.apply(lambda d: pd.DataFrame({"count":[len(d)]},
index=[pd.to_datetime(d["time"]).min().strftime("%H:%M")
+"-"+pd.to_datetime(d["time"]).max().strftime("%H:%M")]))
.reset_index()
.drop(columns=["time"])
.rename(columns={"level_1":"time"})
)
time
count
0
00:01-00:05
4
1
02:05-02:05
1
My solution generates row counts for all 24 hours, with 0 for hours "absent"
in the source DataFrame.
To show a more instructive example, I defined the source DataFrame containing
rows from several hours:
date time
0 2006-11-10 01:21:00
1 2006-11-10 02:26:00
2 2006-11-10 02:38:00
3 2006-11-10 05:38:00
4 2006-11-10 05:38:00
5 2006-11-11 05:43:00
6 2006-11-11 05:51:00
Note that last 2 rows are from different date, but as you want grouping by
hour only, they will be counted in the same group as previous 2 rows
(hour 5).
The first step is to create a Series containing almost what you want:
wrk = df.groupby(pd.to_datetime(df.time).dt.hour).apply(
lambda grp: grp.index.size).reindex(range(24), fill_value=0)
The initial part of wrk is:
time
0 0
1 1
2 2
3 0
4 0
5 4
6 0
7 0
The left column (the index) contains hour as an integer and the
right column is the count - how many rows are in this hour.
The only thing to do is to reformat the index to your desired format:
wrk.index = wrk.index.map(lambda h: f'{h:02}:00-{h:02}:59')
The result (initial part only) is:
time
00:00-00:59 0
01:00-01:59 1
02:00-02:59 2
03:00-03:59 0
04:00-04:59 0
05:00-05:59 4
06:00-06:59 0
07:00-07:59 0
But if you want to get counts only for hours present in your source
data, then drop .reindex(…) from the above code.
Then your (full) result, for the above DataFrame will be:
time
01:00-01:59 1
02:00-02:59 2
05:00-05:59 4
dtype: int64

Pandas groupby cumulative sum ignore current row

I know there's some questions about this topic (like Pandas: Cumulative sum of one column based on value of another) however, none of them fuull fill my requirements.
Let's say I have a dataframe like this one
.
I want to compute the cumulative sum of Cost grouping by month, avoiding taking into account the current value, in order to get the Desired column.By using groupby and cumsum I obtain colum CumSum
.
The DDL to generate the dataframe is
df = pd.DataFrame({'Month': [1,1,1,2,2,1,3],
'Cost': [5,8,10,1,3,4,1]})
IIUC you can use groupby.cumsum and then just subtract cost;
df['cumsum_'] = df.groupby('Month').Cost.cumsum().sub(df.Cost)
print(df)
Month Cost cumsum_
0 1 5 0
1 1 8 5
2 1 10 13
3 2 1 0
4 2 3 1
5 1 4 23
6 3 1 0
You can do the following:
df['agg']=df.groupby('Month')['Cost'].shift().fillna(0)
df['Cumsum']=df['Cost']+df['agg']

Sort data frame in ascending order by mean of other column [duplicate]

This question already has answers here:
How to sort a dataFrame in python pandas by two or more columns?
(3 answers)
Closed 3 years ago.
I have a data frame:
df =
ID Num
a 3
b 4
b 2
a 1
Want to sort in ascending order by taking into account unique values of ID column
My Try:
df.sort_values(by=['Num'])
But it gave me ascending order by neglecting ID column
Desired output:
df =
ID Num
a 1
a 3
b 2
b 4
Just do:
df.sort_values(['ID', 'Num'])
Output
ID Num
3 a 1
0 a 3
2 b 2
1 b 4

Combine pandas DataFrames to give unique element counts

I have a few pandas DataFrames and I am trying to find a good way to calculate and plot the number of times each unique entry occurs across DataFrames. As an example if I had the 2 following DataFrames:
year month
0 1900 1
1 1950 2
2 2000 3
year month
0 1900 1
1 1975 2
2 2000 3
I was thinking maybe there is a way to combine them into a single DataFrame while using a new column counts to keep track of the number of times a unique combination of year + month occurred in any of the DataFrames. From there I figured I could just scatter plot the year + month combinations with their corresponding counts.
year month counts
0 1900 1 2
1 1950 2 1
2 2000 3 2
3 1975 2 1
Is there a good way to achieve this?
concat then using groupby agg
pd.concat([df1,df2]).groupby('year').month.agg(['count','first']).reset_index().rename(columns={'first':'month'})
Out[467]:
year count month
0 1900 2 1
1 1950 1 2
2 1975 1 2
3 2000 2 3

Python: drop value=0 row in specific columns [duplicate]

This question already has answers here:
How to delete rows from a pandas DataFrame based on a conditional expression [duplicate]
(6 answers)
How do you filter pandas dataframes by multiple columns?
(10 answers)
Closed 4 years ago.
I want to drop rows with zero value in specific columns
>>> df
salary age gender
0 10000 23 1
1 15000 34 0
2 23000 21 1
3 0 20 0
4 28500 0 1
5 35000 37 1
some data in columns salary and age are missing
and the third column, gender is a binary variables, which 1 means male 0 means female. And 0 here is not a missing data,
I want to drop the row in either salary or age is missing
so I can get
>>> df
salary age gender
0 10000 23 1
1 15000 34 0
2 23000 21 1
3 35000 37 1
Option 1
You can filter your dataframe using pd.DataFrame.loc:
df = df.loc[~((df['salary'] == 0) | (df['age'] == 0))]
Option 2
Or a smarter way to implement your logic:
df = df.loc[df['salary'] * df['age'] != 0]
This works because if either salary or age are 0, their product will also be 0.
Option 3
The following method can be easily extended to several columns:
df.loc[(df[['a', 'b']] != 0).all(axis=1)]
Explanation
In all 3 cases, Boolean arrays are generated which are used to index your dataframe.
All these methods can be further optimised by using numpy representation, e.g. df['salary'].values.

Categories

Resources