Splitting, merging, sorting CSV - python

I have several CSV files containing measurements from several sensors
s1.CSV :
date;hour;source;values
01/25/12;10:20:00;a; 88 -84 27
01/25/12;10:30:00;a; -80
01/25/12;10:50:00;b; -96 3 -88
01/25/12;09:00:00;b; -97 101
01/25/12;09:10:00;c; 28
s2.CSV :
date;hour;source;values
01/25/12;10:20:00;a; 133
01/25/12;10:25:00;a; -8 -5
I'd like to create one CSV by source (a/b/c) with every measure in separated column sorted by date and hour
a.CSV :
date;hour;source;s1;s2
01/25/12;10:20:00;a; 88 -84 27; 133
01/25/12;10:25:00;a; ; -8 -5
01/25/12;10:30:00;a; -80;
...
I'm stuck here :
import glob
import csv
import os
os.system('cls')
sources = dict()
sensor = 0
filelist = glob.glob("*.csv")
for f in filelist:
reader = csv.DictReader(open(f),delimiter=";")
for row in reader:
# date = row['date'] # date later
hour = row['hour']
val = row['values']
source = row['source']
if not sources.has_key(source): # new source
sources[source] = list()
#
sources[source].append({'hour':hour, 'sensor'+`sensor`:val})
sensor+=1
I'm not sure the data structure is good to sort. I also fell like repeating column name.

Using your data provided, I cooked up something using Pandas. Please see code below.
The output, granted, is non-ideal, as the hour and source get repeated within a column. As I am learning too, I'd also welcome any expert input on whether Pandas can do what the OP is asking for!
In [1]: import pandas as pd
In [2]: s1 = pd.read_csv('s1.csv', delimiter=';', parse_dates=True)
In [3]: s1
Out[3]:
date hour source values
0 01/25/12 10:20:00 a 88 -84 27
1 01/25/12 10:30:00 a -80
2 01/25/12 10:50:00 b -96 3 -88
3 01/25/12 09:00:00 b -97 101
4 01/25/12 09:10:00 c 28
In [4]: s2 = pd.read_csv('s2.csv', delimiter=';', parse_dates=True)
In [5]: s2
Out[5]:
date hour source values
0 01/25/12 10:20:00 a 133
1 01/25/12 10:25:00 a -8 -5
In [6]: joined = s1.append(s2)
In [7]: joined
Out[7]:
date hour source values
0 01/25/12 10:20:00 a 88 -84 27
1 01/25/12 10:30:00 a -80
2 01/25/12 10:50:00 b -96 3 -88
3 01/25/12 09:00:00 b -97 101
4 01/25/12 09:10:00 c 28
0 01/25/12 10:20:00 a 133
1 01/25/12 10:25:00 a -8 -5
In [8]: grouped = joined.groupby('hour').sum()
In [9]: grouped.to_csv('a.csv')
In [10]: grouped
Out[10]:
date source values
hour
09:00:00 01/25/12 b -97 101
09:10:00 01/25/12 c 28
10:20:00 01/25/1201/25/12 aa 88 -84 27 133
10:25:00 01/25/12 a -8 -5
10:30:00 01/25/12 a -80
10:50:00 01/25/12 b -96 3 -88

If I understand correctly, you have multiple files, each corresponding to a given "sensor", with the identity of the sensor in the filename. You want to read the files, then write them out in to separate files again, this time divided by "source", with the data from the different sensors combined into several final rows.
Here's what I think you want to do:
Read the data in, and build a nested dictionary data structure, as follows:
The top level key would be the source (e.g. 'a').
The second level will be keyed by a (date, time) tuple.
The inner most level will be keyed by sensor, taken from the filename, and have the actual sensor readings as values.
You'd also want to keep track of all the sensors that have been seen.
To write the data out, you'd loop over the items of the outermost dictionary, creating a new output file for each one.
The rows of each file would be determined by sorting the keys of the next dictionary.
The last value of each row would be formed by concatenating the values of the innermost dict, filling in an empty string for any missing values.
Here's some code:
from collections import defaultdict
from datetime import datetime
import csv
import glob
import os
# data structure is data[source][date, time][sensor] = value, with "" as default value
data = defaultdict(lambda: defaultdict(lambda: defaultdict(str)))
sensors = []
filelist = glob.glob("*.csv")
# read old files
for fn in filelist:
sensor = os.path.splitext(fn)[0]
sensors.append(sensor)
with open(fn, 'rb') as f:
reader = csv.DictReader(f, delimiter=";")
for row in reader:
date = datetime.strptime(row['date'], '%m/%d/%y')
data[row['source']][date, row['hour']][sensor] = row['values']
sensors.sort() # note, this may not give the best sort order
header = ['date', 'hour', 'source'] + sensors
for source, source_data in data.iteritems():
fn = "{}.csv".format(source)
with open(fn, 'wb') as f:
writer = csv.writer(f, delimiter=";")
writer.writerow(header)
for (date, time), hour_data in sorted(source_data.items()):
values = [hour_data[sensor] for sensor in sensors]
writer.writerow([date.strftime('%m/%d/%y'), time, source] + values)
I only convert the date field to an internal type because otherwise sorting based on dates won't work correctly (dates in January 2013 would appear before those in February 2012). In the future, consider using ISO 8601 style date formating, YYYY-MM-DD, which can be safely sorted as a string. The rest of the values are handled only as strings with no interpretation.
The code assumes that the sensor values can be ordered lexicographically. This is likely if you only have a few of them, e.g. s1 and s2. However, if you have a s10, it will be sorted ahead of s2. To solve this you'll need a "natural" sort, which is more complicated than I can solve here (but see this recent question for more info).
One final warning: This solution may do bad things if you run it mutliple times in the same folder. That's because the output files, e.g. a.csv will be seen by glob.glob('*.csv') as input files when you run again.

Related

Convert a pandas column of int to timestamp datatype when column name is not known

Hi my use case is I have dynamic names of datetime field and date will be some thing like unix timestamp . so every time there could be different column name and there can be multiple date filed . so how I can do this ? for now if I do hardcode for column name this works for me
df['date'] = pandas.to_datetime(df['date'], unit='s')
but not sure how I can make this for dynamic names and multiple fields with pandas
As suggested on my comment, you can try to convert all columns as DatetimeIndex then just keep one where conversion succeeded?
# Inspired from #mozway, https://stackoverflow.com/a/75106101/15239951
def find_datetime_col(df):
mask = df.astype(str).apply(pd.to_datetime, errors='coerce').notna()
return mask.mean().idxmax()
col = find_datetime_col(df)
print(col)
# Output
Heading 1
Input dataframe:
>>> df
Heading 1 Heading 2 Heading 3 Heading 4
0 2023-01-01 34 12 34
1 2023-01-02 42 99 42
2 2023-01-03 42 99 42
3 2023-01-04 42 99 42

How to sum a column in python without calling it a dataframe

I have data that outputs into a csv file as:
url date id hits
a 2017-01-01 123 2
a 2017-01-01 123 2
b 2017-01-01 45 25
c 2017-01-01 123 5
d 2017-01-03 678 1
d 2017-01-03 678 7
and so on where hits is the number of times the id value appears on a given day per url. (ie: the id number 123 appears 2 times on 2017-01-01 for url "a".
I need to create another column after hits, called "total hits" that captures the total number of hits there are per day for a given url, date and id value. So the output would look like this..
url date id hits total_hits
a 2017-01-01 123 2 4
a 2017-01-01 123 2 4
b 2017-01-01 45 25 25
c 2017-01-01 123 5 5
d 2017-01-03 678 1 8
d 2017-01-03 678 7 8
if there are solutions to this without using pandas or numpy that would be amazing.
Please help! Thanks in advance.
Simple with standard python installation.
read & parse file using line-by-line read & split
create a collections.defaultdict(int) to count the occurences of the url/date/id triplet
add the info in an extra column
write back (I chose csv)
like this:
import collections,csv
d = collections.defaultdict(int)
rows = []
with open("input.csv") as f:
title = next(f).split() # skip title
for line in f:
toks = line.split()
d[toks[0],toks[1],toks[2]] += int(toks[3])
rows.append(toks)
# complete data
for row in rows:
row.append(d[row[0],row[1],row[2]])
title.append("total_hits")
with open("out.csv","w",newline="") as f:
cw = csv.writer(f)
cw.writerow(title)
cw.writerows(rows)
here's the output file:
url,date,id,hits,total_hits
a,2017-01-01,123,2,4
a,2017-01-01,123,2,4
b,2017-01-01,45,25,25
c,2017-01-01,123,5,5
d,2017-01-03,678,1,8
d,2017-01-03,678,7,8

how to speed up dataframe analysis

I'm looping through a DataFrame of 200k rows. It's doing what I want but it takes hours. I'm not very sophisticated when it comes to all the ways you can join and manipulate DataFrames so I wonder if I'm doing this in a very inefficient way. It's quite simple, here's the code:
three_yr_gaps = []
for index, row in df.iterrows():
three_yr_gaps.append(df[(df['GROUP_ID'] == row['GROUP_ID']) &
(df['BEG_DATE'] >= row['THREE_YEAR_AGO']) &
(df['END_DATE'] <= row['BEG_DATE'])]['GAP'].sum() + row['GAP'])
df['GAP_THREE'] = three_yr_gaps
The DF has a column called GAP that holds an integer value. the logic I'm employing to sum this number up is:
for each row get these columns from the dataframe:
those that match on the group id, and...
those that have a beginning date within the last 3 years of this rows start date, and...
those that have an ending date before this row's beginning date.
sum up those rows GAP number and add this row's GAP number then append those to a list of indexes.
So is there a faster way to introduce this logic into some kind of automatic merge or join that could speed up this process?
PS.
I was asked for some clarification on input and output, so here's a constructed dataset to play with:
from dateutil import parser
df = pd.DataFrame( columns = ['ID_NBR','GROUP_ID','BEG_DATE','END_DATE','THREE_YEAR_AGO','GAP'],
data = [['09','185',parser.parse('2008-08-13'),parser.parse('2009-07-01'),parser.parse('2005-08-13'),44],
['10','185',parser.parse('2009-08-04'),parser.parse('2010-01-18'),parser.parse('2006-08-04'),35],
['11','185',parser.parse('2010-01-18'),parser.parse('2011-01-18'),parser.parse('2007-01-18'),0],
['12','185',parser.parse('2014-09-04'),parser.parse('2015-09-04'),parser.parse('2011-09-04'),0]])
and here's what I wrote at the top of the script, may help:
The purpose of this script is to extract gaps counts over the
last 3 year period. It uses gaps.sql as its source extract. this query
returns a DataFrame that looks like this:
ID_NBR GROUP_ID BEG_DATE END_DATE THREE_YEAR_AGO GAP
09 185 2008-08-13 2009-07-01 2005-08-13 44
10 185 2009-08-04 2010-01-18 2006-08-04 35
11 185 2010-01-18 2011-01-18 2007-01-18 0
12 185 2014-09-04 2015-09-04 2011-09-04 0
The python code then looks back at the previous 3 years (those
previous rows that have the same GROUP_ID but whose effective dates
come after their own THIRD_YEAR_AGO and whose end date come before
their own beginning date). Those rows are added up and a new column is
made called GAP_THREE. What remains is this:
ID_NBR GROUP_ID BEG_DATE END_DATE THREE_YEAR_AGO GAP GAP_THREE
09 185 2008-08-13 2009-07-01 2005-08-13 44 44
10 185 2009-08-04 2010-01-18 2006-08-04 35 79
11 185 2010-01-18 2011-01-18 2007-01-18 0 79
12 185 2014-09-04 2015-09-04 2011-09-04 0 0
you'll notice that row id_nbr 11 has a 79 value in the last 3 years but id_nbr 12 has 0 because the last gap was 35 in 2009 which is more than 3 years before 12's beginning date of 2014

How to append multiple pandas DataFrames in a loop?

I've been banging my head on this python problem for a while and am stuck. I am for-looping through several csv files and want one data frame that appends the csv files in a way that one column from each csv file is a column name and sets a common index of a date_time.
There are 11 csv files that look like this data frame except for different value and pod number, but the time_stamp is the same for all the csvs.
data
pod time_stamp value
0 97 2016-02-22 3.048000
1 97 2016-02-29 23.622001
2 97 2016-03-07 13.970001
3 97 2016-03-14 6.604000
4 97 2016-03-21 NaN
And this is the for-loop that I have so far:
import glob
import pandas as pd
filenames = sorted(glob.glob('*.csv'))
new = []
for f in filenames:
data = pd.read_csv(f)
time_stamp = [pd.to_datetime(d) for d in time_stamp]
new.append(data)
my_df = pd.DataFrame(new, columns=['pod','time_stamp','value'])
What I want is a data frame that looks like this where each column is the result of value from each of the csv files.
time_stamp 97 98 99 ...
2016-02-22 3.04800 4.20002 3.5500
2016-02-29. 23.62201 24.7392 21.1110
2016-03-07 13.97001 11.0284 12.0000
But right now the output of my_df is very wrong and looks like this. Any ideas of where I went wrong?
0
0 pod time_stamp value 0 22 2016-...
1 pod time_stamp value 0 72 2016-...
2 pod time_stamp value 0 79 2016-0...
3 pod time_stamp value 0 86 2016-...
4 pod time_stamp value 0 87 2016-...
5 pod time_stamp value 0 88 2016-...
6 pod time_stamp value 0 90 2016-0...
7 pod time_stamp value 0 93 2016-0...
8 pod time_stamp value 0 95 2016-...
I'd recommend first concatenating all your dataframes together with pd.concat, and then doing one final pivot operation.
filenames = sorted(glob.glob('*.csv'))
new = [pd.read_csv(f, parse_dates=['time_stamp']) for f in filenames]
df = pd.concat(new) # omit axis argument since it is 0 by default
df = df.pivot(index='time_stamp', columns='pod')
Note that I'm forcing read_csv to parse time_stamp when loading the dataframe, so parsing after loading is no longer required.
MCVE
df
pod time_stamp value
0 97 2016-02-22 3.048000
1 97 2016-02-29 23.622001
2 97 2016-03-07 13.970001
3 97 2016-03-14 6.604000
4 97 2016-03-21 NaN
df.pivot(index='time_stamp', columns='pod')
value
pod 97
time_stamp
2016-02-22 3.048000
2016-02-29 23.622001
2016-03-07 13.970001
2016-03-14 6.604000
2016-03-21 NaN

Create empty csv file with pandas

I am interacting through a number of csv files and want to append the mean temperatures to a blank csv file. How do you create an empty csv file with pandas?
for EachMonth in MonthsInAnalysis:
TheCurrentMonth = pd.read_csv('MonthlyDataSplit/Day/Day%s.csv' % EachMonth)
MeanDailyTemperaturesForCurrentMonth = TheCurrentMonth.groupby('Day')['AirTemperature'].mean().reset_index(name='MeanDailyAirTemperature')
with open('my_csv.csv', 'a') as f:
df.to_csv(f, header=False)
So in the above code how do I create the my_csv.csv prior to the for loop?
Just a note I know you can create a data frame then save the data frame to csv but I am interested in whether you can skip this step.
In terms of context I have the following csv files:
Each of which have the following structure:
The Day column reads up to 30 days for each file.
I would like to output a csv file that looks like this:
But obviously includes all the days for all the months.
My issue is that I don't know which months are included in each analysis hence I wanted to use a for loop that used a list that has that information in it to access the relevant csvs, calculate the mean temperature then save it all into one csv.
Input as text:
Unnamed: 0 AirTemperature AirHumidity SoilTemperature SoilMoisture LightIntensity WindSpeed Year Month Day Hour Minute Second TimeStamp MonthCategorical TimeOfDay
6 6 18 84 17 41 40 4 2016 1 1 6 1 1 10106 January Day
7 7 20 88 22 92 31 0 2016 1 1 7 1 1 10107 January Day
8 8 23 1 22 59 3 0 2016 1 1 8 1 1 10108 January Day
9 9 23 3 22 72 41 4 2016 1 1 9 1 1 10109 January Day
10 10 24 63 23 83 85 0 2016 1 1 10 1 1 10110 January Day
11 11 29 73 27 50 1 4 2016 1 1 11 1 1 10111 January Day
Just open the file in write mode to create it.
with open('my_csv.csv', 'w'):
pass
Anyway I do not think you should be opening and closing the file so many times. You'd better open the file once, write several times.
with open('my_csv.csv', 'w') as f:
for EachMonth in MonthsInAnalysis:
TheCurrentMonth = pd.read_csv('MonthlyDataSplit/Day/Day%s.csv' % EachMonth)
MeanDailyTemperaturesForCurrentMonth = TheCurrentMonth.groupby('Day')['AirTemperature'].mean().reset_index(name='MeanDailyAirTemperature')
df.to_csv(f, header=False)
Creating a blank csv file is as simple as this one
import pandas as pd
pd.DataFrame({}).to_csv("filename.csv")
I would do it this way: first read up all your CSV files (but only the columns that you really need) into one DF, then make groupby(['Year','Month','Day']).mean() and save resulting DF into CSV file:
import glob
import pandas as pd
fmask = 'MonthlyDataSplit/Day/Day*.csv'
df = pd.concat((pd.read_csv(f, sep=',', usecols=['Year','Month','Day','AirTemperature']) for f in glob.glob(fmask)))
df.groupby(['Year','Month','Day']).mean().to_csv('my_csv.csv')
and if want to ignore the year:
import glob
import pandas as pd
fmask = 'MonthlyDataSplit/Day/Day*.csv'
df = pd.concat((pd.read_csv(f, sep=',', usecols=['Month','Day','AirTemperature']) for f in glob.glob(fmask)))
df.groupby(['Month','Day']).mean().to_csv('my_csv.csv')
Some details:
(pd.read_csv(f, sep=',', usecols=['Month','Day','AirTemperature']) for f in glob.glob('*.csv'))
will generate tuple of data frames from all your CSV files
pd.concat(...)
will concatenate them into resulting single DF
df.groupby(['Year','Month','Day']).mean()
will produce wanted report as a data frame, which might be saved into new CSV file:
.to_csv('my_csv.csv')
The problem is a little unclear, but assuming you have to iterate month by month, and apply the groupby as stated just use:
#Before loops
dflist=[]
Then in each loop do something like:
dflist.append(MeanDailyTemperaturesForCurrentMonth)
Then at the end:
final_df = pd.concat([dflist], axis=1)
and this will join everything into one dataframe.
Look at:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html
http://pandas.pydata.org/pandas-docs/stable/merging.html
You could do this to create an empty CSV and add columns without an index column as well.
import pandas as pd
df=pd.DataFrame(columns=["Col1","Col2","Col3"]).to_csv(filename.csv,index=False)

Categories

Resources