How to read a csv and aggregate data by groups?

How to read a csv and aggregate data by groups? - python

We have a csv file and written below code to do a group by and get the max value and create an output file. But while reading final output file using data-frame read_csv , it is showing as empty..
Input file:
Manoj,2020-01-01 01:00:00
Manoj,2020-02-01 01:00:00
Manoj,2020-03-01 01:00:00
Rajesh,2020-01-01 01:00:00
Rajesh,2020-05-01 01:00:00
Suresh,2020-04-01 01:00:00
Final output file:
Manoj,2020-03-01 01:00:00
Rajesh,2020-05-01 01:00:00
Suresh,2020-04-01 01:00:00
and then when i am trying to read above final output file using df.read_Csv it shows dataframe empty.
import os
import re
import pandas as pd
z=open('outfile.csv','w')
fin=[]
k=open('j.csv','r')
for m in k:
d=m.split(',')[0]
if d not in fin:
fin.append(d.strip())
for p in fin:
gg=[]
g=re.compile(r'{0}'.format(p))
y=open('j.csv','r')
for b in y:
if re.search(g,b):
gg.append(b)
z.write(gg[-1].strip())
z.write('\n')
df = pd.read_csv("outfile.csv", delimiter=',', names=['Col1','Col2'], header=0)
print(df)
final output: Empty DataFrame , Index: []
Is there anything i missed , please any one suggest...

It's not necessary to use the for-loop to process the file. The data aggregation is more easily completed in pandas.
Your csv is shown without headers, so read the file in with pandas.read_csv, header=None, and use parse_dates to correctly format the datetime column.
The column with datetimes, is shown at column index 1, therefore parse_dates=[1]
This assumes the data begins on row 0 in the file and has no headers, as shown in the OP.
Create headers for the columns
As per a comment, the date component of 'datetime' can be accessed with the .dt accessor.
.groupby on name and aggregate .max()
import pandas as pd
# read the file j.csv
df = pd.read_csv('j.csv', header=None, parse_dates=[1])
# add headers
df.columns = ['name', 'datetime']
# select only the date component of datetime
df.datetime = df.datetime.dt.date
# display(df)
name datetime
0 Manoj 2020-01-01
1 Manoj 2020-02-01
2 Manoj 2020-03-01
3 Rajesh 2020-01-01
4 Rajesh 2020-05-01
5 Suresh 2020-04-01
# groupby
dfg = df.groupby('name')['datetime'].max().reset_index()
# display(dfg)
name datetime
0 Manoj 2020-03-01
1 Rajesh 2020-05-01
2 Suresh 2020-04-01
# save the file. If the headers aren't wanted, use `header=False`
dfg.to_csv('outfile.csv', index=False)

Create dataframe
import pandas as pd
df=pd.DataFrame(zip(
['Manoj','Manoj','Manoj','Rajesh','Rajesh','Suresh'],
['2020-01-01','2020-02-01','2020-03-01','2020-01-01','2020-05-01','2020-04-01'],
['01:00:00','01:00:00','01:00:00','01:00:00','01:00:00','01:00:00']),
columns=['name','date','time'])
Convert date and time from string to date and time object
df['date']=pd.to_datetime(df['date'], infer_datetime_format=True).dt.date
df['time']=pd.to_datetime(df['time'],format='%H:%M:%S').dt.time
Take groupby
out=df.groupby(by=['name','time']).max().reset_index()
You can save and load it again
out.to_csv('out.csv',index=False)
df1=pd.read_csv('out.csv')
result
name time date
0 Manoj 01:00:00 2020-03-01
1 Rajesh 01:00:00 2020-05-01
2 Suresh 01:00:00 2020-04-01
Sorry, I created two separate columns for date and time, but I hope you can understand it

Related

Pandas GroupBy on consecutive blocks of rows

How do you groupby on consecutive blocks of rows where each block is separated by a threshold value?
I have the following sample pandas dataframe, and I'm having difficulty getting blocks of rows whose difference in their dates is greater than 365 days.
Date
Data
2019-01-01
A
2019-05-01
B
2020-04-01
C
2021-07-01
D
2022-02-01
E
2024-05-01
F
The output I'm looking for is the following,
Min Date
Max Date
Data
2019-01-01
2020-04-01
ABC
2021-07-01
2022-02-01
DE
2024-05-01
2024-05-01
F
I was looking at pandas .diff() and .cumsum() for getting the number of days between two rows and filtering for rows with difference > 365 days, however, it doesn't work when the dataframe has multiple blocks of rows.

I would also suggest .diff() and .cumsum():
import pandas as pd
df = pd.read_clipboard()
df["Date"] = pd.to_datetime(df["Date"])
blocks = df["Date"].diff().gt("365D").cumsum()
out = df.groupby(blocks).agg({"Date": ["min", "max"], "Data": "sum"})
out:
Date Data
min max sum
Date
0 2019-01-01 2019-05-01 AB
1 2020-06-01 2020-06-01 C
2 2021-07-01 2022-02-01 DE
3 2024-05-01 2024-05-01 F
after which you can replace the column labels (now a 2 level MultiIndex) as appropriate.
The date belonging to data "C" is more than 365 days apart from both "B" and "D", so it got its own group. Or am I misunderstanding your expected output?

Extracting date time from a mixed letter and numeric column pandas

I have a column in pandas dataframe that contains two types of information = 1. date and time, 2=company name. I have to split the column into two (date_time, full_company_name). Firstly I tried to split the columns based on character count (first 19 one column, the rest to other column) but then I realized that sometimes the date is missing so the split might not work. Then I tried using regex but I cant seem to extract it correctly.
The column:
the desired output:

If the dates are all properly formatted, maybe you don't have to use regex
df = pd.DataFrame({"A": ["2021-01-01 05:00:00Acme Industries",
"2021-01-01 06:00:00Acme LLC"]})
df["date"] = pd.to_datetime(df.A.str[:19])
df["company"] = df.A.str[19:]
df
# A date company
# 0 2021-01-01 05:00:00Acme Industries 2021-01-01 05:00:00 Acme Industries
# 1 2021-01-01 06:00:00Acme LLC 2021-01-01 06:00:00 Acme LLC
OR
df.A.str.extract("(\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2})(.*)")

Note:
If you have an option to avoid concatenating those strings to begin with, please do so. This is not a healthy habit.
Solution (not that pretty gets the job done):
import pandas as pd
from datetime import datetime
import re
df = pd.DataFrame()
# creating a list of companies
companies = ['Google', 'Apple', 'Microsoft', 'Facebook', 'Amazon', 'IBM',
'Oracle', 'Intel', 'Yahoo', 'Alphabet']
# creating a list of random datetime objects
dates = [datetime(year=2000 + i, month=1, day=1) for i in range(10)]
# creating the column named 'date_time/full_company_name'
df['date_time/full_company_name'] = [f'{str(dates[i])}{companies[i]}' for i in range(len(companies))]
# Before:
# date_time/full_company_name
# 2000-01-01 00:00:00Google
# 2001-01-01 00:00:00Apple
# 2002-01-01 00:00:00Microsoft
# 2003-01-01 00:00:00Facebook
# 2004-01-01 00:00:00Amazon
# 2005-01-01 00:00:00IBM
# 2006-01-01 00:00:00Oracle
# 2007-01-01 00:00:00Intel
# 2008-01-01 00:00:00Yahoo
# 2009-01-01 00:00:00Alphabet
new_rows = []
for row in df['date_time/full_company_name']:
# extract the date_time from the row using regex
date_time = re.search(r'\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}', row)
# handle case of empty date_time
date_time = date_time.group() if date_time else ''
# extract the company name from the row from where the date_time ends
company_name = row[len(date_time):]
# create a new row with the extracted date_time and company_name
new_rows.append([date_time, company_name])
# drop the column 'date_time/full_company_name'
df = df.drop(columns=['date_time/full_company_name'])
# add the new columns to the dataframe: 'date_time' and 'company_name'
df['date_time'] = [row[0] for row in new_rows]
df['company_name'] = [row[1] for row in new_rows]
# After:
# date_time full_company_name
# 2000-01-01 00:00:00 Google
# 2001-01-01 00:00:00 Apple
# 2002-01-01 00:00:00 Microsoft
# 2003-01-01 00:00:00 Facebook
# 2004-01-01 00:00:00 Amazon
# 2005-01-01 00:00:00 IBM
# 2006-01-01 00:00:00 Oracle
# 2007-01-01 00:00:00 Intel
# 2008-01-01 00:00:00 Yahoo
# 2009-01-01 00:00:00 Alphabet

use a non capturing group ?.* instead of (.*)
df = pd.DataFrame({"A": ["2021-01-01 05:00:00Acme Industries",
"2021-01-01 06:00:00Acme LLC"]})
df.A.str.extract("(\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2})?.*")

Python - Pandas, issue with datetime format

I have a two-columns data frame, with departure and arrival times (see example below).
In order to make operations on those times, i want to convert the string into datetime format, keeping only hour/minutes/seconds information.
Example of input data - file name = table
departure_time,arrival_time
07:00:00,07:30:00
07:00:00,07:15:00
07:05:00,07:22:00
07:10:00,07:45:00
07:15:00,07:50:00
07:10:00,07:26:00
07:40:00,08:10:00
I ran this code to import the table file and then to convert the type in to datetime format:
import pandas as pd
from datetime import datetime
df= pd.read_excel("table.xlsx")
df['arrival_time']= pd.to_datetime(df['arrival_time'], format= '%H:%M:%S')
but get this error:
ValueError: time data ' 07:30:00' does not match format '%H:%M:%S' (match)
What mistake i am making?

Seems like an import issue ' 07:30:00', there's a space in front. If it's a CSV you're importing you can use skipinitialspace=True.
If I import your CSV file, and use your code, it works fine:
CSV:
departure_time,arrival_time
07:00:00,07:30:00
07:00:00,07:15:00
07:05:00,07:22:00
07:10:00,07:45:00
07:15:00,07:50:00
07:10:00,07:26:00
07:40:00,08:10:00
df = pd.read_csv('test.csv', skipinitialspace=True)
df['arrival_time']= pd.to_datetime(df['arrival_time'], format='%H:%M:%S').dt.time
print(df)
departure_time arrival_time
0 07:00:00 07:30:00
1 07:00:00 07:15:00
2 07:05:00 07:22:00
3 07:10:00 07:45:00
4 07:15:00 07:50:00
5 07:10:00 07:26:00
6 07:40:00 08:10:00

How to assign a unique ID to an entire dataframe?

I have a number of dataframes consisting of date time and precipitation data that I would like to later merge and plot by station ID. The station ID is present in the "header" of the df but I am not able to reason out how to assign the entire df a unique ID based on it. The .csv file name that the df is constructed from also has the station ID in the name.
For clarification, I am reading in each .csv file from path using an if statement within a loop to differentiate between the different file names and adjust the column headers as needed.
Reading in each file, where 8085 is part of the filename, looks like
def function:
if '8085' in file:
df = pd.read_csv(path + file, usecols=['Variable', 'Date', 'Time', 'FPR-D Oil'], names=['Variable', 'Date', 'Time', 'FPR-D Oil'], header = None, parse_dates = [[1,2]], skiprows=[0,1])
df_fprd_oil = df[df['Variable'].str.contains('Precip')]
Example of the .csv file before reading in:
Station ID,Sensor Serial Num,
12345678,sn123456789,
Precip,02/01/2020,09:45:00,-2.19,
Batt Voltage,02/01/2020,09:45:00,13.4,
Temp In Box,02/01/2020,09:45:00,-2.58,
Precip,02/01/2020,10:00:00,-2.19,
Batt Voltage,02/01/2020,10:00:00,13.6,
Temp In Box,02/01/2020,10:00:00,-2.17,
Example of the df after reading in:
Date_Time Variable FPR-D Oil
0 2020-02-01 09:45:00 Precip -2.19
3 2020-02-01 10:00:00 Precip -2.19
6 2020-02-01 10:15:00 Precip -2.19
What (I think) is desired
Date_Time Station ID Variable FPR-D Oil
0 2020-02-01 09:45:00 12345678 Precip -2.19
3 2020-02-01 10:00:00 12345678 Precip -2.19
6 2020-02-01 10:15:00 12345678 Precip -2.19
Or maybe even
Date_Time Variable FPR-D Oil Unique ID
0 2020-02-01 09:45:00 Precip -2.19 1
3 2020-02-01 10:00:00 Precip -2.19 1
6 2020-02-01 10:15:00 Precip -2.19 1

If you want to add UniqueID to your Dataframe and want it to be constant for one dataframe then you can simply do this.
df["UniqueID"] = 1
It will add a column named UniqueID in your existing df with the assigned value.
Hope it helps.

I am assuming, based on your comments above, that the station id is always located at the second row of the first column in all the csv files.
import pandas as pd
from io import StringIO
# sample data
s = """Station ID,Sensor, Serial Num,
12345678,123456789,
Precip,02/01/2020,09:45:00,-2.19,
Batt Voltage,02/01/2020,09:45:00,13.4,
Temp In Box,02/01/2020,09:45:00,-2.58,
Precip,02/01/2020,10:00:00,-2.19,
Batt Voltage,02/01/2020,10:00:00,13.6,
Temp In Box,02/01/2020,10:00:00,-2.17,"""
# read your file
df = pd.read_csv(StringIO(s), usecols=['Variable', 'Date', 'Time', 'FPR-D Oil'],
skiprows=[0,1], names=['Variable', 'Date', 'Time', 'FPR-D Oil'])
# read it again but only get the first value of the second row
sid = pd.read_csv(StringIO(s), skiprows=1, nrows=1, header=None)[0].iloc[0]
# filter and copy so you are not assign to a slice of a frame
new_df = df[df['Variable'] == 'Precip'].copy()
# assign sid to a new column
new_df.loc[:, 'id'] = sid
print(new_df)
Variable Date Time FPR-D Oil id
0 Precip 02/01/2020 09:45:00 -2.19 12345678
3 Precip 02/01/2020 10:00:00 -2.19 12345678

Try this:
df['Station ID'] = 12345678
Where df is your dataframe

Setting a dataframe columns to type datetime when there are blanks in the columns

I have a dataframe (df) with two columns where the head looks like
name start end
0 John 2018-11-09 00:00:00 2012-03-01 00:00:00
1 Steve 1990-09-03 00:00:00
2 Debs 1977-09-07 00:00:00 2012-07-02 00:00:00
3 Mandy 2009-01-09 00:00:00
4 Colin 1993-08-22 00:00:00 2002-06-03 00:00:00
The start and end columns have the type object. I want to change the type to datetime so I can use the following:
referenceError = DeptTemplate['start'] > DeptTemplate['end']
am trying to change the type using:
df['start'].dt.strftime('%d/%m/%Y')
df['end'].dt.strftime('%d/%m/%Y')
but I think where there are some rows where there are no date in the columns its causing a problem. How can I set any blank values so I can change the type to date time and run my analysis?

As shown in the .to_datetime docs you can set the behavior using the errors kwarg. You can also set the strftime format with the format kwarg.
# Bad values will be NaT
df["start"] = pd.to_datetime(df.start, errors='coerce', format='%d/%m/%Y')
As mentioned in the comments, you can prepare the column with replace if you absolutely must use strftime.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to read a csv and aggregate data by groups? - python

Related

Pandas GroupBy on consecutive blocks of rows

Extracting date time from a mixed letter and numeric column pandas

Python - Pandas, issue with datetime format

How to assign a unique ID to an entire dataframe?

Setting a dataframe columns to type datetime when there are blanks in the columns

Categories

Resources