Can't display text column when split text into associated table - python

Here's my dataset, (only one column)
Apr 1 09:14:55 i have apple
Apr 2 08:10:10 i have mango
There's the result I need
month date time message
Apr 1 09:14:55 i have apple
Apr 2 09:10:10 i have mango
This is what I've done
import pandas as pd
month = []
date = []
time = []
message = []
for line in dns_data:
month.append(line.split()[0])
date.append(line.split()[1])
time.append(line.split()[2])
df = pd.DataFrame(data={'month': month, 'date':date, 'time':time})
This is the output I get
month date time
0 Apr 1 09:14:55
1 Apr 2 09:10:10
How to display message column?

Use parameter n in Series.str.split for spliting by first 3 whitespaces, expand=True is for output DataFrame:
print (df)
col
0 Apr 1 09:14:55 i have apple
1 Apr 2 08:10:10 i have mango
df1 = df['col'].str.split(n=3, expand=True)
df1.columns=['month','date','time','message']
print (df1)
month date time message
0 Apr 1 09:14:55 i have apple
1 Apr 2 08:10:10 i have mango
Another solution with list comprehension:
c = ['month','date','time','message']
df1 = pd.DataFrame([x.split(maxsplit=3) for x in df['col']], columns=c)
print (df1)
month date time message
0 Apr 1 09:14:55 i have apple
1 Apr 2 08:10:10 i have mango

You could use Series.str.extractall with a regex pattern:
df = pd.DataFrame({'text': {0: 'Apr 1 09:14:55 i have apple', 1: 'Apr 2 08:10:10 i have mango'}})
df_new = (df.text.str
.extractall(r'^(?P<month>\w{3})\s?(?P<date>\d{1,2})\s?(?P<time>\d{2}:\d{2}:\d{2})\s?(?P<message>.*)$')
.reset_index(drop=True))
print(df_new)
month date time message
0 Apr 1 09:14:55 i have apple
1 Apr 2 08:10:10 i have mango

This may help you.
(?<Month>\w+)\s(?<Date>\d+)\s(?<Time>[\w:]+)\s(?<Message>.*)
Match 1
Month Apr
Date 1
Time 09:14:55
Message i have apple
Match 2
Month Apr
Date 2
Time 08:10:10
Message i have mango
https://rubular.com/r/1S4BcbDxPtlVxE

Related

Split columns by space or dash - python

I have a pandas df with mixed formatting for a specific column. It contains the qtr and year. I'm hoping to split this column into separate columns. But the formatting contains a space or a second dash between qtr and year.
I'm hoping to include a function that splits the column by a blank space or a second dash.
df = pd.DataFrame({
'Qtr' : ['APR-JUN 2019','JAN-MAR 2019','JAN-MAR 2015','JUL-SEP-2020','OCT-DEC 2014','JUL-SEP-2015'],
})
out:
Qtr
0 APR-JUN 2019 # blank
1 JAN-MAR 2019 # blank
2 JAN-MAR 2015 # blank
3 JUL-SEP-2020 # second dash
4 OCT-DEC 2014 # blank
5 JUL-SEP-2015 # second dash
split by blank
df[['Qtr', 'Year']] = df['Qtr'].str.split(' ', 1, expand=True)
split by second dash
df[['Qtr', 'Year']] = df['Qtr'].str.split('-', 1, expand=True)
intended output:
Qtr Year
0 APR-JUN 2019
1 JAN-MAR 2019
2 JAN-MAR 2015
3 JUL-SEP 2020
4 OCT-DEC 2014
5 JUL-SEP 2015
You can use a regular expression with the extract function of the string accessor.
df[['Qtr', 'Year']] = df['Qtr'].str.extract(r'(\w{3}-\w{3}).(\d{4})')
print(df)
Result
Qtr Year
0 APR-JUN 2019
1 JAN-MAR 2019
2 JAN-MAR 2015
3 JUL-SEP 2020
4 OCT-DEC 2014
5 JUL-SEP 2015
You can split using regex using positive lookahead and non capturing group (?:..), then filter out the empty values, and apply a pandas Series on the values:
>>> (df.Qtr.str.split('\s|(.+(?<=-).+)(?:-)')
.apply(lambda x: [i for i in x if i])
.apply(lambda x: pd.Series(x, index=['Qtr', 'Year']))
)
Qtr Year
0 APR-JUN 2019
1 JAN-MAR 2019
2 JAN-MAR 2015
3 JUL-SEP 2020
4 OCT-DEC 2014
5 JUL-SEP 2015
If, and only if, the data is in the posted format you could use list slicing.
import pandas as pd
df = pd.DataFrame(
{
"Qtr": [
"APR-JUN 2019",
"JAN-MAR 2019",
"JAN-MAR 2015",
"JUL-SEP-2020",
"OCT-DEC 2014",
"JUL-SEP-2015",
],
}
)
df[['Qtr', 'Year']] = [(x[:7], x[8:12]) for x in df['Qtr']]
print(df)
Qtr Year
0 APR-JUN 2019
1 JAN-MAR 2019
2 JAN-MAR 2015
3 JUL-SEP 2020
4 OCT-DEC 2014
5 JUL-SEP 2015

Count a certain value for each country

I am attempting to do a Excel countif function with pandas but hitting a roadblock in doing so.
I have this dataframe. I need to count the YES for each country quarter-wise. I have posted the requested answers below.
result.head(3)
Country Jan 1 Feb 1 Mar 1 Apr 1 May 1 Jun 1 Quarter_1 Quarter_2
FRANCE Yes Yes No No No No 2 0
BELGIUM Yes Yes No Yes No No 2 1
CANADA Yes No No Yes No No 1 1
I tried the following but Pandas spats out a total value instead showing a 5 for all the values under Quarter_1. I am oblivious on how to calculate my function below by Country? Any assistance with this please!
result['Quarter_1'] = len(result[result['Jan 1'] == 'Yes']) + len(result[result['Feb 1'] == 'Yes'])
+ len(result[result['Mar 1'] == 'Yes'])
We can use the length of your column and take the floor division to create your quarters. Then we groupby on these and take the sum.
Finally to we add the prefix Quarter:
df = df.set_index('Country')
grps = np.arange(len(df.columns)) // 3
dfn = (
df.join(df.eq('Yes')
.groupby(grps, axis=1)
.sum()
.astype(int)
.add_prefix('Quarter_'))
.reset_index()
)
Or using list comprehension to rename your columns:
df = df.set_index('Country')
grps = np.arange(len(df.columns)) // 3
dfn = df.eq('Yes').groupby(grps, axis=1).sum().astype(int)
dfn.columns = [f'Quarter_{col+1}' for col in dfn.columns]
df = df.join(dfn).reset_index()
Country Jan 1 Feb 1 Mar 1 Apr 1 May 1 Jun 1 Quarter_1 Quarter_2
0 FRANCE Yes Yes No No No No 2 0
1 BELGIUM Yes Yes No Yes No No 2 1
2 CANADA Yes No No Yes No No 1 1

How to filter certain values in consecutive months?

I have a dataframe structured as follows:
Name Month Grade
Sue Jan D
Sue Feb D
Jason Mar B
Sue Mar D
Jason Jan B
Sue Apr A
Jason Feb C
I want to get the list of students who got D 3 consecutive months in the past 6 months. In the example above, Sue will be on the list since she got D in Jan, Feb ad March. How can I do that using Python or Pandas or Numpy?
I tried to solve your problem. I do have a solution for you but it may not be the fastest in terms of efficiency / code execution. Please see below:
newdf = df.pivot(index='Name', columns='Month', values='Grade')
newdf = newdf[['Jan', 'Feb', 'Mar', 'Apr']].fillna(-1)
newdf['concatenated'] = newdf['Jan'].astype('str') + newdf['Feb'].astype('str') + newdf['Mar'].astype('str') + newdf['Apr'].astype('str')
newdf[newdf['concatenated'].str.contains('DDD', regex=False, na=False)]
Output will be like:
Month Jan Feb Mar Apr concatenated
Name
Sue D D D A DDDA
If you just want the names, then the following command instead.
newdf[newdf['concatenated'].str.contains('DDD', regex=False, na=False)].index.to_list()
I came up with this.
df['Month_Nr'] = pd.to_datetime(df.Month, format='%b').dt.month
names = df.Name.unique()
students = np.array([])
for name in names:
filter = df[(df.Name==name) & (df.Grade=='D')].sort_values('Month_Nr')
if filter['Month_Nr'].diff().cumsum().max() >= 2:
students = np.append(students, name)
print(students)
Output:
['Sue']
you have a few ways to deal with this, first use my previous solution but this will require mapping academic numbers to months (i.e September = 1, August = 12) that way you can apply math to work out consecutive values.
the following is to covert the Month into a DateTime and work out the difference in months, we can then apply a cumulative sum and filter any values greater than 3.
d = StringIO("""Name Month Grade
Sue Jan D
Sue Feb D
Jason Mar B
Sue Dec D
Jason Jan B
Sue Apr A
Jason Feb C""")
df = pd.read_csv(d,sep='\s+')
df['date'] = pd.to_datetime(df['Month'],format='%b').dt.normalize()
# set any values greater than June to the previous year.
df['date'] = np.where(df['date'].dt.month > 6,
(df['date'] - pd.DateOffset(years=1)),df['date'])
df.sort_values(['Name','date'],inplace=True)
def month_diff(date):
cumlative_months = (
np.round(((date.sub(date.shift(1)) / np.timedelta64(1, "M")))).eq(1).cumsum()
) + 1
return cumlative_months
df['count'] = df.groupby(["Name", "Grade"])["date"].apply(month_diff)
print(df.drop('date',axis=1))
Name Month Grade count
4 Jason Jan B 1
6 Jason Feb C 1
2 Jason Mar B 1
3 Sue Dec D 1
0 Sue Jan D 2
1 Sue Feb D 3
5 Sue Apr A 1
print(df.loc[df['Name'] == 'Sue'])
Name Month Grade date count
3 Sue Dec D 1899-12-01 1
0 Sue Jan D 1900-01-01 2
1 Sue Feb D 1900-02-01 3
5 Sue Apr A 1900-04-01 1

Sort groupby pandas output by Month name and year

df
order_date Month Name Year Days Data
2015-12-20 Dec 2014 1 3
2016-1-21 Jan 2014 2 3
2015-08-20 Aug 2015 1 1
2016-04-12 Apr 2016 4 1
and so on
Code:
df = df.groupby(["Year", "Month Name"], as_index=False)["days"].agg(['min',
'mean'])
df3 = (df.groupby(["Year", "Month Name"], as_index=False)
["Data"].agg(['count']))
merged_df=pd.merge(df3, df, on=['Year','Month Name'])
I have a groupby output as below
Min Mean Count
Year Month Name
2015 Aug 2 11 200
Dec 5 13 130
Feb 3 15 100
Jan 4 20 123
May 1 21 342
Nov 2 12 234
2016 Apr 1 10 200
Dec 2 12 120
Feb 2 13 200
Jan 2 24 200
Sep 1 25 220
Issue:
Basically I am getting output of groupby sorted by Month Name starting from A to Z, So I am getting April, August, December, Feb etc......rather than Jan, Feb ....till Dec etc. How to get output sorted by Month number.
I need output like 2016, Jan, Feb ....Dec then 2017, Jan , Feb, Mar till Dec
Please help if there is merging of 2 dfs. I have just presented a simplified code here(real code is different, I need to merge both and then only I can work)
EDIT: Your solution should be changed:
df1 = df.groupby(["Year", "Month Name"], as_index=False)["Days"].agg(['min', 'mean'])
df3 = df.groupby(["Year", "Month Name"], as_index=False)["Data"].agg(['count'])
merged_df=pd.merge(df3, df1, on=['Year','Month Name']).reset_index()
cats = ['Jan', 'Feb', 'Mar', 'Apr','May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
merged_df['Month Name'] = pd.Categorical(merged_df['Month Name'],categories=cats, ordered=True)
merged_df = merged_df.sort_values(["Year", "Month Name"])
print (merged_df)
Year Month Name count min mean
1 2014 Jan 1 2 2
0 2014 Dec 1 1 1
2 2015 Aug 1 1 1
3 2016 Apr 1 4 4
Or:
df1 = (df.groupby(["Year", "Month Name"])
.agg(min_days=("Days", 'min'),
avg_days=("Days", 'mean'),
count = ('Data', 'count'))
.reset_index())
cats = ['Jan', 'Feb', 'Mar', 'Apr','May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
df1['Month Name'] = pd.Categorical(df1['Month Name'], categories=cats, ordered=True)
df1 = df1.sort_values(["Year", "Month Name"])
print (df1)
Year Month Name min_days avg_days count
1 2014 Jan 2 2 1
0 2014 Dec 1 1 1
2 2015 Aug 1 1 1
3 2016 Apr 4 4 1
Last solution with MultiIndex and no categoricals, solution create helper dates column and sorting by it:
df1 = (df.groupby(["Year", "Month Name"])
.agg(min_days=("Days", 'min'),
avg_days=("Days", 'mean'),
count = ('Data', 'count'))
)
df1['dates'] = pd.to_datetime([f'{y}{m}' for y, m in df1.index], format='%Y%b')
df1 = df1.sort_values('dates')
print (df1)
min_days avg_days count dates
Year Month Name
2014 Jan 2 2 1 2014-01-01
Dec 1 1 1 2014-12-01
2015 Aug 1 1 1 2015-08-01
2016 Apr 4 4 1 2016-04-01
Simply tell groupby you don't want it to sort group keys (by default, that's what it does - see the docs)
df.groupby(["Year", "Month Name"], as_index=False, sort=False)["Days"].agg(
["min", "mean"]
)
NOTE: you should make sure your df is sorted before applying groupby
Here is my solution to sort by month number and return sorted month names for level=1 of multiindex taking merged_df as the input:
import calendar
d={i:e for e,i in enumerate([*calendar.month_abbr])}
#for full month name use :-> d={i:e for e,i in enumerate([*calendar.month_name])}
merged_df.index=pd.MultiIndex.from_tuples(sorted(merged_df.index,key=lambda x: d.get(x[1])))
merged_df = merged_df.sort_index(level=0)
print(merged_df)
count min mean
Year Month Name
2014 Jan 1 2 2
Dec 1 1 1
2015 Aug 1 1 1
2016 Apr 1 4 4

Sum of a Column in a Dictionary of Dataframes

How can I work with a dictionary of dataframes please? Or, is there a better way to get an overview of my data? If I have for example:
Fruit Qty Year
Apple 2 2016
Orange 1 2017
Mango 2 2016
Apple 9 2016
Orange 8 2015
Mango 7 2016
Apple 6 2016
Orange 5 2017
Mango 4 2015
Then I am trying to find out how many in total I get per year, for example:
2015 2016 2017
Apple 0 11 0
Orange 8 0 6
Mango 4 9 0
I have written some code but it might not be useful:
import pandas as pd
# Fruit Data
df_1 = pd.DataFrame({'Fruit':['Apple','Orange','Mango','Apple','Orange','Mango','Apple','Orange','Mango'], 'Qty': [2,1,2,9,8,7,6,5,4], 'Year': [2016,2017,2016,2016,2015,2016,2016,2017,2015]})
# Create a list of Fruits
Fruits = df_1.Fruit.unique()
# Break down the dataframe by Year
df_2015 = df_1[df_1['Year'] == 2015]
df_2016 = df_1[df_1['Year'] == 2016]
df_2017 = df_1[df_1['Year'] == 2017]
# Create a dataframe dictionary of Fruits
Dict_2015 = {elem : pd.DataFrame for elem in Fruits}
Dict_2016 = {elem : pd.DataFrame for elem in Fruits}
Dict_2017 = {elem : pd.DataFrame for elem in Fruits}
# Store the Qty for each Fruit x each Year
for Fruit in Dict_2015.keys():
Dict_2015[Fruit] = df_2015[:][df_2015.Fruit == Fruit]
for Fruit in Dict_2016.keys():
Dict_2016[Fruit] = df_2016[:][df_2016.Fruit == Fruit]
for Fruit in Dict_2017.keys():
Dict_2017[Fruit] = df_2017[:][df_2017.Fruit == Fruit]
You can use pandas.pivot_table.
res = df.pivot_table(index='Fruit', columns=['Year'], values='Qty',
aggfunc=np.sum, fill_value=0)
print(res)
Year 2015 2016 2017
Fruit
Apple 0 17 0
Mango 4 9 0
Orange 8 0 6
For guidance on usage, see How to pivot a dataframe.
jpp has already posted an answer in the format you wanted. However, since your question seemed like you are open to other views, I thought of sharing another way. Not exactly in the format you posted but this how I usually do it.
df = df.groupby(['Fruit', 'Year']).agg({'Qty': 'sum'}).reset_index()
This will look something like:
Fruit Year Sum
Apple 2015 0
Apple 2016 11
Apple 2017 0
Orange 2015 8
Orange 2016 0
Orange 2017 6
Mango 2015 4
Mango 2016 9
Mango 2017 0

Categories

Resources