I'm doing a project that involves analyzing WhatsApp log data.
After preprocessing the log file I have a table that looks like this:
DD/MM/YY | hh:mm | name | text |
I could build a graph where, using a chat with a friend of mine, I plotted a graph of the number of text per month and the mean number of words per month but I have some problems:
If in a month we didn't exchange text the algorithm doesn't count that month, therefore in the graph I want to see that month with 0 messages
there is a better way to utilize dates and time in python? Using them as strings isn't so intuitive but online I didn't found anything useful.
this is the GitLab page of my project.
def wapp_split(line):
splitted = line.split(',')
Data['date'].append(splitted[0])
splitted = splitted[1].split(' - ')
Data['time'].append(splitted[0])
splitted = splitted[1].split(':')
Data['name'].append(splitted[0])
Data['msg'].append(splitted[1][0:-1])
def wapp_parsing(file):
with open(file) as f:
data = f.readlines()
for line in data:
if (line[17:].find(':')!= -1):
if (line[0] in numbers) and (line[1]in numbers):
prev = line[0:35]
wapp_split(line)
else:
line = prev + line
wapp_split(line)
Those are the main function of the script. The WhatsApp log is formatted like so:
DD/MM/YY, hh:mm - Name Surname: This is a text sent using WhatsApp
The parsing function just take the file and send each line to the split function. Those if in the parsing function just avoid that mssages from WhatsApp and not from the people in the chat being parsed.
Suppose that the table you have is a .csv file that looks like this (call it msgs.csv):
date;time;name;text
22/10/2018;11:30;Maria;Hello how are you
23/10/2018;11:30;Justin;Check this
23/10/2018;11:31;Justin;link
22/11/2018;11:30;Maria;Hello how are you
23/11/2018;11:30;Justin;Check this
23/12/2018;11:31;Justin;link
22/12/2018;11:30;Maria;Hello how are you
23/12/2018;11:30;Justin;Check this
23/01/2019;11:31;Justin;link
23/04/2019;11:30;Justin;Check this
23/07/2019;11:31;Justin;link
Now you can use pandas to import this csv in a table format that will recognise both date and time as a timestamp object and then for your calculations you can group the data by month.
import pandas as pd
dateparse = lambda x: pd.datetime.strptime(x, '%d/%m/%Y %H:%M')
df = pd.read_csv('msgs.csv', delimiter=';', parse_dates=[['date', 'time']], date_parser=dateparse)
per = df.date_time.dt.to_period("M")
g = df.groupby(per)
for i in g:
print('#######')
print('year: {year} ; month: {month} ; number of messages: {n_msgs}'
.format(year=i[0].year, month=i[0].month, n_msgs=len(i[1])))
EDIT - no information about specific month = 0 messages:
In order to get the 0 for months in which no messages were sent you can do like this (looks better than above too):
import pandas as pd
dateparse = lambda x: pd.datetime.strptime(x, '%d/%m/%Y %H:%M')
df = pd.read_csv('msgs.csv', delimiter=';', parse_dates=[['date', 'time']], date_parser=dateparse)
# create date range from oldest message to newest message
dates = pd.date_range(*(pd.to_datetime([df.date_time.min(), df.date_time.max()]) + pd.offsets.MonthEnd()), freq='M')
for i in dates:
df_aux = df[(df.date_time.dt.month == i.month) & (df.date_time.dt.year == i.year)]
print('year: {year} ; month: {month} ; number of messages: {n_msgs}'
.format(year=i.year, month=i.month, n_msgs=len(df_aux)))
EDIT 2: parse logs into a pandas dataframe:
df = pd.DataFrame({'logs':['DD/MM/YY, hh:mm - Name Surname: This is a text sent using WhatsApp',
'DD/MM/YY, hh:mm - Name Surname: This is a text sent using WhatsApp']})
pat = re.compile("(?P<date>.*?), (?P<time>.*?) - (?P<name>.*?): (?P<message>.*)")
df_parsed = df.logs.str.extractall(pat)
It's best to convert the strings into datetime objects
from datetime import datetime
datetime_object = datetime.strptime('22/10/18', '%d/%m/%y')
When converting from a string, remember to use the correct seperators, ie "-" or "/" to match the string, and the letters in the format template on the right hand side of the function to parse with the date string too. Full details on the meaning of the letters can be found at Python strptime() Method
A simple solution for adding missing dates and plotting the mean value of msg_len is to create a date range your interested in then reindex:
df.set_index('date', inplace=True)
df1 = df[['msg_len','year']]
df1.index = df1.index.to_period('m')
msg_len year
date
2016-08 11 2016
2016-08 4 2016
2016-08 3 2016
2016-08 4 2016
2016-08 15 2016
2016-10 10 2016
# look for date range between 7/2016 and 11/2016
idx = pd.date_range('7-01-2016','12-01-2016',freq='M').to_period('m')
new_df = pd.DataFrame(df1.groupby(df1.index)['msg_len'].mean()).reindex(idx, fill_value=0)
new_df.plot()
msg_len
2016-07 0.0
2016-08 7.4
2016-09 0.0
2016-10 10.0
2016-11 0.0
you can change mean to anything if you want the count of messages for a given month etc.
Related
I have a csv file that has a column of dates. The dates are in order of month - so January comes first, then Feb, and so on. The problem is some of the dates are in mm/dd/yyyy format and others in dd/mm/yyyy format. Here's what it looks like.
Date
01/08/2005
01/12/2005
15/01/2005
19/01/2005
22/01/2005
26/01/2005
29/01/2005
03/02/2005
05/02/2005
...
I would like to bring all of them to the same format (dd/mm/yyyy)
I am using Python and pandas to read and edit the csv file. I tried using Excel to manually change the date formats using the built-in formatting tools but it seems impossible with the large number of rows. I'm thinking of using regex but I'm not quite sure how to distinguish between month-first and day-first.
# here's what i have so far
date = df.loc[i, 'Date']
pattern = r'\d\d/\d\d/\d\d'
match = re.search(pattern, date)
if match:
date_items = date.split('/')
day = date_items[1]
month = date_items[0]
year = date_items[2]
new_date = f'{dd}/{mm}/{year}'
df.loc[i, 'Date'] = new_date
I want the csv to have a uniform date format in the end.
In short: you can't!
There's no way for you to know if 01/02/2019 is Jan 2nd or Feb 1st!
Same goes for other dates in your examples such as:
01/08/2005
01/12/2005
03/02/2005
05/02/2005
I have a csv with a date column with dates listed as MM/DD/YY but I want to change the years from 00,02,03 to 1900, 1902, 1903 so that they are instead listed as MM/DD/YYYY
This is what works for me:
df2['Date'] = df2['Date'].str.replace(r'00', '1900')
but I'd have to do this for every year up until 68 (aka repeat this 68 times). I'm not sure how to create a loop to do the code above for every year in that range. I tried this:
ogyear=00
newyear=1900
while ogyear <= 68:
df2['date']=df2['Date'].str.replace(r'ogyear','newyear')
ogyear += 1
newyear += 1
but this returns an empty data set. Is there another way to do this?
I can't use datetime because it assumes that 02 refers to 2002 instead of 1902 and when I try to edit that as a date I get an error message from python saying that dates are immutable and that they must be changed in the original data set. For this reason I need to keep the dates as strings. I also attached the csv here in case thats helpful.
I would do it like this:
# create a data frame
d = pd.DataFrame({'date': ['20/01/00','20/01/20','20/01/50']})
# create year column
d['year'] = d['date'].str.split('/').str[2].astype(int) + 1900
# add new year into old date by replacing old year
d['new_data'] = d['date'].str.replace('[0-9]*.$','') + d['year'].astype(str)
date year new_data
0 20/01/00 1900 20/01/1900
1 20/01/20 1920 20/01/1920
2 20/01/50 1950 20/01/1950
I'd do it the following way:
from datetime import datetime
# create a data frame with dates in format month/day/shortened year
d = pd.DataFrame({'dates': ['2/01/10','5/01/20','6/01/30']})
#loop through the dates in the dates column and add them
#to list in desired form using datetime library,
#then substitute the dataframe dates column with the new ordered list
new_dates = []
for date in list(d['dates']):
dat = datetime.date(datetime.strptime(date, '%m/%d/%y'))
dat = dat.strftime("%m/%d/%Y")
new_dates.append(dat)
new_dates
d['dates'] = pd.Series(new_dates)
d
I have a date column in a pandas.DataFrame in various date time formats and stored as list object, like the following:
date
1 [May 23rd, 2011]
2 [January 1st, 2010]
...
99 [Apr. 15, 2008]
100 [07-11-2013]
...
256 [9/01/1995]
257 [04/15/2000]
258 [11/22/68]
...
360 [12/1997]
361 [08/2002]
...
463 [2014]
464 [2016]
For the sake of convenience, I want to convert them all to MM/DD/YYYY format. It doesn't seem possible to use regex replace() function to do this, since one cannot execute this operation over list objects. Also, to use strptime() for each cell will be too time-consuming.
What will be the easier way to convert them all to the desired MM/DD/YYYY format? I found it very hard to do this on list objects within a dataframe.
Note: for cell values of the form [YYYY] (e.g., [2014] and [2016]), I will assume they are the first day of that year (i.e., January 1, 1968) and for cell values such as [08/2002] (or [8/2002]), I will assume they the first day of the month of that year (i.e., August 1, 2002).
Given your sample data, with the addition of a NaT, this works:
Code:
df.date.apply(lambda x: pd.to_datetime(x).strftime('%m/%d/%Y')[0])
Test Code:
import pandas as pd
df = pd.DataFrame([
[['']],
[['May 23rd, 2011']],
[['January 1st, 2010']],
[['Apr. 15, 2008']],
[['07-11-2013']],
[['9/01/1995']],
[['04/15/2000']],
[['11/22/68']],
[['12/1997']],
[['08/2002']],
[['2014']],
[['2016']],
], columns=['date'])
df['clean_date'] = df.date.apply(
lambda x: pd.to_datetime(x).strftime('%m/%d/%Y')[0])
print(df)
Results:
date clean_date
0 [] NaT
1 [May 23rd, 2011] 05/23/2011
2 [January 1st, 2010] 01/01/2010
3 [Apr. 15, 2008] 04/15/2008
4 [07-11-2013] 07/11/2013
5 [9/01/1995] 09/01/1995
6 [04/15/2000] 04/15/2000
7 [11/22/68] 11/22/1968
8 [12/1997] 12/01/1997
9 [08/2002] 08/01/2002
10 [2014] 01/01/2014
11 [2016] 01/01/2016
It would be better if you use this it'll give you the date format in MM-DD-YYYY the you can apply strftime:
df['Date_ColumnName'] = pd.to_datetime(df['Date_ColumnName'], dayfirst = False, yearfirst = False)
Provided code will work for following scenarios.
Change date format from M/D/YY to MM/DD/YY (5/2/2009 to 05/02/2009)
change form ANY FORMAT to MM/DD/YY
import pandas as pd
'''
* checking provided input file date format correct or not
* if format is correct change date format from M/D/YY to MM/DD/YY
* else date format is not correct in input file
Date format change form ANY FORMAT to MM/DD/YY
'''
input_file_name = 'C:/Users/Admin/Desktop/SarenderReddy/predictions.csv'
dest_file_name = 'C:/Users/Admin/Desktop/SarenderReddy/Enrich.csv'
#input_file_name = 'C:/Users/Admin/Desktop/SarenderReddy/enrichment.csv'
read_data = pd.read_csv(input_file_name)
print(pd.to_datetime(read_data['Date'], format='%m/%d/%Y', errors='coerce').notnull().all())
if pd.to_datetime(read_data['Date'], format='%m/%d/%Y', errors='coerce').notnull().all():
print("Provided correct input date format in input file....!")
read_data['Date'] = pd.to_datetime(read_data['Date'],format='%m/%d/%Y')
read_data['Date'] = read_data['Date'].dt.strftime('%m/%d/%Y')
read_data.to_csv(dest_file_name,index=False)
print(read_data['Date'])
else:
print("NOT... Provided correct input date format in input file....!")
data_format = pd.read_csv(input_file_name,parse_dates=['Date'], dayfirst=True)
#print(df['Date'])
data_format['Date'] = pd.to_datetime(data_format['Date'],format='%m/%d/%Y')
data_format['Date'] = data_format['Date'].dt.strftime('%m/%d/%Y')
data_format.to_csv(dest_file_name,index=False)
print(data_format['Date'])
I have a dataset that looks like this:
import numpy as np
import pandas as pd
raw_data = {'Series_Date':['2017-03-10','2017-04-13','2017-05-14','2017-05-15','2017-06-01']}
df = pd.DataFrame(raw_data,columns=['Series_Date'])
print df
I would like to pass in a date parameter as a string as follows:
date = '2017-03-22'
I would now like to know if there are any dates in my DataFrame 'df' for which the month is 3 months after the month in the date parameter.
That is if the month in the date parameter is March, then it should check if there are any dates in df from June. If there are any, I would like to see those dates. If not, it should just output 'No date found'.
In this example, the output should be '2017-06-01' as it is a date from June as my date parameter is from March.
Could anyone help how may I get started with this?
convert your column to Timestamp
df.Series_Date = pd.to_datetime(df.Series_Date)
date = pd.to_datetime('2017-03-01')
Then
df[
(df.Series_Date.dt.year - date.year) * 12 +
df.Series_Date.dt.month - date.month == 3
]
Series_Date
4 2017-06-01
I've researched this question heavily for the past few days and I still cannot find suggestions to my problem.
Below is an example of my dataframe titled 'dfs'. There are around 80 columns, only 4 shown in the below example.
dfs is a large dataframe consisting of rows of data reported every 15 minutes for over 12 months (i.e. 2015-08-01 00:00:00 to 2016-09-30 23:45:00). The Datetime column is in the format datetime.
...
...
I want to export (or write) multiple monthly csv files, which are snippets of monthly data taken from the original large csv file (dfs). For each month, I want a file to be written that contains the the raw data, day data (6am-6pm) and night data (6pm-6am). I also want the name of each monthly file to be automated so it knows whether to call itself dfs_%Y%m, or dfs_day_%Y%m, or dfs_night_%Y%m depending on the data it contains.
At the moment I am writing out over 180 lines of code to export each csv file.
For example:
I create monthly raw, day and night files by grabbing the data between the datetimes listed below from the index Datetime column
dfs201508 = dfs.ix['2015-08-01 00:00:00':'2015-08-31 23:45:00']
dfs201508Day = dfsDay.ix['2015-08-01 00:00:00':'2015-08-31 23:45:00']
dfs201508Night = dfsNight.ix['2015-08-01 00:00:00':'2015-08-31 23:45:00']
Then I export these files to their respective outputpaths and give them a filename
dfs201508 = dfs201508.to_csv(outputpath+"dfs201508.csv")
dfs201508Day = dfs201508Day.to_csv(outputpathDay+"dfs_day_201508.csv")
dfs201508Night = dfs201508Night.to_csv(outputpathNight+"dfs_night_201508.csv")
What I want to write is something like this
dfs_%Y%m = dfs.ix["%Y%m"]
dfs_day_%Y%m = dfs.ix["%Y%m(between 6am-6pm)"]
dfs_night_%Y%m = dfs.ix["%Y%m(between 6pm-6am)"]
dfs_%Y%m = dfs_%Y%m.to_csv(outputpath +"dfs_%Y%m.csv")
dfs_day_%Y%m = dfs_day_%Y%m.to_csv(outputpath%day +"dfs_day_%Y%m.csv")
dfs_night_%Y%m = dfs_night_%Y%m.to_csv(outputpath%night +"dfs_night_%Y%m.csv")
Any suggestions on the code to automate this process would be greatly appreciated.
Here are some links to pages I researched:
https://www.youtube.com/watch?v=aeZKJGEfD7U
Writing multiple Python dictionaries to csv file
Open a file name +date as csv in Python
You can use a for loop to iterate over the years and months contained within dfs. I created a dummy dataframe called DF in the below example, which contains just three sample columns:
dates Egen1_kwh Egen2_kwh
2016-01-01 00:00:00 15895880 15877364
2016-01-01 00:15:00 15895880 15877364
2016-01-01 00:30:00 15895880 15877364
2016-01-01 00:45:00 15895880 15877364
2016-01-01 01:00:00 15895880 15877364
The below code filters the main dataframe DF into smaller dataframes (NIGHT and DAY) for each month within each year and saves them to as .csv with a name corresponding to their date (e.g. 2016_1_DAY and 2016_1_NIGHT for Jan 2016 Day and Jan 2016 Night).
import pandas as pd
import datetime
from dateutil.relativedelta import relativedelta
from random import randint
# I defined a sample dataframe with dummy data
start = datetime.datetime(2016,1,1,0,0)
dates = [start + relativedelta(minutes=15*i) for i in range(0,10000)]
Egen1_kwh = randint(15860938,15898938)
Egen2_kwh = randint(15860938,15898938)
DF = pd.DataFrame({
'dates': dates,
'Egen1_kwh': Egen1_kwh,
'Egen2_kwh': Egen2_kwh,
})
# define when day starts and ends (MUST USE 24 CLOCK)
day = {
'start': datetime.time(6,0), # start at 6am (6:00)
'end': datetime.time(18,0) # ends at 6pm (18:00)
}
# capture years that appear in dataframe
min_year = DF.dates.min().year
max_year = DF.dates.max().year
if min_year == max_year:
yearRange = [min_year]
else:
yearRange = range(min_year, max_year+1)
# iterate over each year and each month within each year
for year in yearRange:
for month in range(1,13):
# filter to show NIGHT and DAY dataframe for given month within given year
NIGHT = DF[(DF.dates >= datetime.datetime(year, month, 1)) &
(DF.dates <= datetime.datetime(year, month, 1) + relativedelta(months=1) - relativedelta(days=1)) &
((DF.dates.apply(lambda x: x.time()) <= day['start']) | (DF.dates.apply(lambda x: x.time()) >= day['end']))]
DAY = DF[(DF.dates >= datetime.datetime(year, month, 1)) &
(DF.dates <= datetime.datetime(year, month, 1) + relativedelta(months=1) - relativedelta(days=1)) &
((DF.dates.apply(lambda x: x.time()) > day['start']) & (DF.dates.apply(lambda x: x.time()) < day['end']))]
# save to .csv with date and time in file name
# specify the save path of your choice
path_night = 'C:\\Users\\nickb\\Desktop\\stackoverflow\\{0}_{1}_NIGHT.csv'.format(year, month)
path_day = 'C:\\Users\\nickb\\Desktop\\stackoverflow\\{0}_{1}_DAY.csv'.format(year, month)
# some of the above NIGHT / DAY filtering will return no rows.
# Check for this, and only save if the dataframe contains rows
if NIGHT.shape[0] > 0:
NIGHT.to_csv(path_night, index=False)
if DAY.shape[0] > 0:
DAY.to_csv(path_day, index=False)