how to count a change in the categorical column in pandas - python

I have a below dataframe:
OUTLET_UNQ_CODE Category_Code month
0 2018020000065 SSSI January 21
1 2018020000066 SSSI January 21
2 2018020000067 SSSI January 21
...
512762 2021031641195 CH March 21
512763 2021031642445 CH March 21
512764 2021031643357 GM March 21
512765 2021031643863 GM March 21
there are few OUTLET_UNQ_CODE who have changed their Category_Code within a month and next month as well. I need to count the number of hops every outlet has done. For ex: If 2021031643863 had Category_code GM in Jan 21 and CH in Jan 21 again, CH in Feb and Kirana in March. This will be counted as 2 hops.
This is what i have tried:
s=pd.to_numeric(new_df.Category_Code,errors='coerce')
df=new_df.assign(New=s.bfill())[s.isnull()].groupby('OUTLET_UNQ_CODE').agg({'Category_Code':list})
df.reset_index(inplace=True)
O/P is:
OUTLET_UNQ_CODE Category_Code
0 2021031643863 [GM,CH,CH,Kirana]

regardless if there is maybe a better way starting from the beginning, to achieve the goal based on your output, here is a piece of code to get the number of changes in the list:
cat_lst = ['GM','CH','CH','Kirana']
a = sum((1 for i,x in enumerate(cat_lst[:-1]) if x!= cat_lst[i+1]))
# in this case the result of a is 2

Related

How to identify people with strictly increasing values?

I'm trying to identify the IDs in a dataframe that have increased in the past XYZ cycles.
My data look like below, I would like to identify IDs like A, because the values for A have been increasing in the past 4 month, from Jan to Apr. My data is updated monthly with 5 years total and it always look at the most recent cycles.
ID Month Value
A Jan 1
A Feb 2
A Mar 3
A Apr 4
B Jan 1
B Feb 1
B Mar 3
B Apr 2
I have tried to use def function for a list to apply to the dataframe, but it doesn't work.
import numpy as np
def Consecutive():
n = len() - 1
return (sum(np.diff(sorted()) == 1) >= n)
Any suggestion is appreciated! Thank you!

Using numpy, how do you calculate snowfall per month?

I have a data set with snowfall records per day for one year. Date variable is in YYYYMMDD form.
Date Snow
20010101 0
20010102 10
20010103 5
20010104 3
20010105 0
...
20011231 0
The actual data is here
https://github.com/emily737373/emily737373/blob/master/COX_SNOW-1.csv
I want to calculate the number of days it snowed each month. I know how to do this with pandas, but for a school project, I need to do it only using numpy. I can not import datetime either, it must be done only using numpy.
The output should be in this form
Month # days snowed
January 13
February 19
March 20
...
December 15
My question is how do I only count the number of days it snowed (basically when snow variable is not 0) without having to do it separately for each month?
I hope you can use some built-in packages, such as datetime, cause it's useful when working with datetime objects.
import numpy as np
import datetime as dt
df = np.genfromtxt('test_files/COX_SNOW-1.csv', delimiter=',', skip_header=1, dtype=str)
date = np.array([dt.datetime.strptime(d, "%Y%m%d").month for d in df[:, 0]])
snow = df[:, 1].copy().astype(np.int32)
has_snowed = snow > 0
for month in range(1, 13):
month_str = dt.datetime(year=1, month=month, day=1).strftime('%B')
how_much_snow = len(snow[has_snowed & (date == month)])
print(month_str, ':', how_much_snow)
I loaded the data as str so we guarantee we can parse the Date column as dates later on. That's why we also need to explicitly convert the snow column to int32, otherwise the > comparison won't work.
The output is as follows:
January : 13
February : 19
March : 20
April : 13
May : 8
June : 9
July : 2
August : 7
September : 9
October : 19
November : 16
December : 15
Let me know if this worked for you or if you have any further questions.

How to extract the day of the month using a regular expression in Pandas?

I have strings inside of a dataframe like this
140 "14 Feb 1995 Primary Care Doctor:
"
141 "30 May 2016 SOS-10 Total Score:
"
142 "22 January 1996 # 11 AMCommunication with referring physician?: Done
"
And I want to extract days and months separately. So I made a list
list=['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
for i in range(500):
for month in list:
a= 'r(\d\d) '+month+'[a-z]{,8}'
b=df[0].str.findall(a)[i]
df['day'][i]=b
When I look for df['day'] I get only [] and I would like to get [14] [30] [22]
Try using this regex:
...
a = r"(\d{1,2}) \w+ \d{4}"
b = df[0].str.findall(a)[i]
df['day'][i] = b
Try this pattern:
pattern = re.compile(r"(?P<day>\d{1,2}) (?P<month>[A-Z][a-z]{2,}) (?P<year>\d{2,4})")
The named capturing groups, like (?P<day> \d{0,2} mean you can access the 3-tuple that is returned and extract just that field.
Then you can do something like this:
>>> for match in re.finditer(pattern, str):
>>> .... print(match.group("day"))
I would also use apply rather than a for loop to access your DataFrame:
>>> data = {"string": ["14 Feb 1995 Primary Care Doctor:",
"30 May 2016 SOS-10 Total Score:",
"22 January 1996 # 11 AMCommunication with referring physician?: Done"] }
>>> df = pd.DataFrame.from_dict(data)
>>> df.string.apply(lambda x: re.search(pattern, x).group("day"))
0 14
1 30
2 22
Name: string, dtype: object
Then you can conveniently save these values separately if you want to:
>>> df["day"] = df.string.apply(lambda x: re.search(pattern, x).group("day"))
>>> df["month"] = df.string.apply(lambda x: re.search(pattern, x).group("month"))
>>> df
string day month
0 14 Feb 1995 Primary Care Doctor: 14 Feb
1 30 May 2016 SOS-10 Total Score: 30 May
2 22 January 1996 # 11 AMCommunication with refe... 22 January
ETA: If you want to tweak it to only extract the abbreviated month, regardless of whether it's fully spelled out, try replacing the regex pattern above with this:
pattern = re.compile(r"(?P<day>\d{1,2}) (?P<month>[A-Z][a-z]{2})[a-z]*? (?P<year>\d{2,4})")
This will capture only the first 3 characters of the month's name, but will find dates even if they have the longer version.

How to calculate number of events per day using python?

I am having problems calculating/counting the number of events per day using python. I have a .txt file of earthquake data that I am using to do this. Here is what the file looks like:
2000 Jan 19 00 21 45 -118.815670 37.533170 3.870000 2.180000 383.270000
2000 Jan 11 16 16 46 -118.804500 37.551330 5.150000 2.430000 380.930000
2000 Jan 11 19 55 54 -118.821830 37.508830 0.600000 2.360000 378.080000
2000 Jan 11 05 33 02 -118.802000 37.554670 4.820000 2.530000 375.480000
2000 Jan 08 19 37 04 -118.815500 37.534670 3.900000 2.740000 373.650000
2000 Jan 09 19 34 27 -118.817670 37.529670 3.990000 3.170000 373.07000
Where column 0 is the year, 1 is the month, 2 is the day. There are no headers.
I want to calculate/count the number of events per day. Each line in the file (example: 2000 Jan 11) is an event. So, On January 11th, I would like to know how many times there was an event. In this case, on January 11th, there were 3 events.
I've tried looking on stack for some guidance and have found code that works for arrays such as:
a = [1, 1, 1, 0, 0, 0, 1]
which counts the occurrence of certain items in the array using code like:
unique, counts = numpy.unique(a, return_counts=True)
dict(zip(unique, counts))
I have not been able to find anything that helps me. Any help/advice would be appreciated.
groupby() is going to be your friend here. However, I would concatenate the Year, Month and Day so that you can use dataframe.groupby(["full_date"]).count()
Full solution
Setup DF
df = pd.DataFrame([[2000, "Jan", 19],[2000, "Jan", 20],[2000, "Jan", 19],[2000, "Jan", 19]], columns = ["Year", "Month", "Day"])
Convert datatypes to str for concatenation
df["Year"] = df["Year"].astype(str)
df["Day"] = df["Day"].astype(str)
Create 'full_date' column
df["full_date"] = df["Year"] + "-" + df["Month"] + "-" + df["Day"]
Count the # of days
df.groupby(["full_date"])["Day"].count()
Hope this helps/provides value :)

Adding column to pandas python dataframe after groupby, maintaining order

I have a data frame with info like:
month year date well_number depth_to_water
April 2007 4/1/07 1 48.60
August 2007 8/1/07 2 80.20
December 2007 12/1/07 EM3 37.50
February 2007 2/1/07 27 32.00
February 2008 2/1/08 27 40.00
I'm trying to create a new column with the year-to-year differences in each month's depth to water, so for 27: 32-40= -8
I've grouped the data frame, i.e.
grouped_dw = davis_wells.groupby(['well_number', 'month','year'], sort=True)
Which gives me exactly the sorting I need to theoretically just iterate through
well_number month year date depth_to_water
1 April 2007 4/1/07 48.60
2008 4/1/08 62.30
2009 4/1/09 55.90
2010 4/1/10 36.20
2011 4/1/11 33.90
Out of which I'm trying to get:
well_number month year date depth_to_water change
1 April 2007 4/1/07 50 NaN
2008 4/1/08 60 -10
2009 4/1/09 55 5
2010 4/1/10 70 -15
2011 4/1/11 30 40
So I tried
grouped_dw['change'] = grouped_dw.depth_to_water(-1) - grouped_dw.depth_to_water
Which throws an error. Any ideas? Pretty sure I'm just not understanding how hierarchical groupedby Dataframes work.
Thanks!
EDIT:
I used sort, which gives me almost everything I need.. except I need it to give a null value when skipping to the next month.
davis_wells = davis_wells.sort(['well_number', 'month'])
davis_wells['change'] = davis_wells.depth_to_water.shift(1) - davis_wells.depth_to_water

Categories

Resources