I am importing an excel worksheet using pandas and trying to remove any instance where there is a duplicate area measurement for a given Frame. The sheet I'm playing with looks vaguely like the table below wherein there are n number of files, a measured area from each frame of an individual file, and the Frame Number that corresponds to each area measurement.
Filename.0
Area.0
Frame.0
Filename.1
Area.1
Frame.1
...
Filename.n
Area.n
Filename.n
Exp327_Date_File_0
600
1
Exp327_Date_File_1
830
1
...
Exp327_Date_File_n
700
1
Exp327_Date_File_0
270
2
Exp327_Date_File_1
730
1
...
Exp327_Date_File_n
600
2
Exp327_Date_File_0
230
3
Exp327_Date_File_1
630
2
...
Exp327_Date_File_n
500
3
Exp327_Date_File_0
200
4
Exp327_Date_File_1
530
3
...
Exp327_Date_File_n
400
4
NaN
NaN
NaN
Exp327_Date_File1
430
4
...
NaN
NaN
NaN
If I manually go through the excel worksheet and concatenate the filenames into just 3 unique columns containing my entire dataset like so:
Filename
Area
Frame
Exp327_Date_File_0
600
1
Exp327_Date_File_0
270
2
etc...
etc...
etc...
Exp327_Date_File_n
530
4
I have been able to successfully use pandas to remove the duplicates using the following:
df_1 = df.groupby(['Filename', 'Frame Number']).agg('Area': 'sum')
However, manually concatenating everything into this format isn't feasible when I have hundreds of File replicates and I will then have to separate everything back out into multiple column-sets (similar to how the data is presented in Table 1). How do I either (1) use pandas to create a new Dataframe with every 3 columns stacked on top of each other which I can then group and aggregate before breaking back up into individual sets of columns based on Filename or (2) loop through the multiple filenames and aggregate any Frames with multiple Areas? I have tried option 2:
(row, col) = df.shape #shape of the data frame the excel file was read into
for count in range(0,round(col/3)): #iterate through the data
aggregation_functions = {'Area.'+str(count):'sum'} #add Areas together
df_2.groupby(['Filename.'+str(count), 'Frame Number.'+str(count)]).agg(aggregation_functions)
However, this just returns the same DataFrame without any of the Areas summed together. Any help would be appreciated and please let me know if my question is unclear
Here is a way to achieve option (1):
import numpy as np
import pandas as pd
# sample data
df = pd.DataFrame({'Filename.0': ['Exp327_Date_File_0', 'Exp327_Date_File_0',
'Exp327_Date_File_0', 'Exp327_Date_File_0',
np.NaN],
'Area.0': [600, 270, 230, 200, np.NaN],
'Frame.0': [1, 2, 3, 4, np.NaN],
'Filename.1': ['Exp327_Date_File_1', 'Exp327_Date_File_1',
'Exp327_Date_File_1', 'Exp327_Date_File_1',
'Exp327_Date_File_1'],
'Area.1': [830, 730, 630, 530, 430],
'Frame.1': [1, 1, 2, 3, 4],
'Filename.2': ['Exp327_Date_File_2', 'Exp327_Date_File_2',
'Exp327_Date_File_2', 'Exp327_Date_File_2',
'Exp327_Date_File_2'],
'Area.2': [700, 600, 500, 400, np.NaN],
'Frame.2': [1, 2, 3, 4, np.NaN]})
# create list of sub-dataframes, each with 3 columns, partitioning the original dataframe
subframes = [df.iloc[:, j:(j + 3)] for j in np.arange(len(df.columns), step=3)]
# set column names to the same values for each subframe
for subframe in subframes:
subframe.columns = ['Filename', 'Area', 'Frame']
# concatenate the subframes
df_long = pd.concat(subframes)
df_long
Filename Area Frame
0 Exp327_Date_File_0 600.0 1.0
1 Exp327_Date_File_0 270.0 2.0
2 Exp327_Date_File_0 230.0 3.0
3 Exp327_Date_File_0 200.0 4.0
4 NaN NaN NaN
0 Exp327_Date_File_1 830.0 1.0
1 Exp327_Date_File_1 730.0 1.0
2 Exp327_Date_File_1 630.0 2.0
3 Exp327_Date_File_1 530.0 3.0
4 Exp327_Date_File_1 430.0 4.0
0 Exp327_Date_File_2 700.0 1.0
1 Exp327_Date_File_2 600.0 2.0
2 Exp327_Date_File_2 500.0 3.0
3 Exp327_Date_File_2 400.0 4.0
4 Exp327_Date_File_2 NaN NaN
I have 2 dataframes. rdf is the reference dataframe I am trying to use to define the interval (top and bottom) to calculate an average between (all of the depths between this interval), but use ldf to actually run that calculation since it contains the values. rdf defines the top and bottom for each id number an average should be run for. There are multiple intervals for each id.
rdf is formatted as such:
ID Top Bottom
1 2010 3000
1 4300 4500
1 4550 5000
1 7100 7700
2 3200 4100
2 4120 4180
2 4300 5300
2 5500 5520
3 2300 2380
3 3200 4500
ldf is fromated as such:
ID Depth(ft) Value1 Value2 Value3
1 2000 45 .32 423
1 2000.5 43 .33 500
1 2001 40 .12 643
1 2001.5 28 .10 20
1 2002 40 .10 34
1 2002.5 23 .11 60
1 2003 34 .08 900
1 2003.5 54 .04 1002
2 2000 40 .28 560
2 2000 38 .25 654
...
3 2000 43 .30 343
I want to use rdf to define the top and bottom of the interval to calculate the average for Value1, Value2, and Value3. I would also like to have a count documented as well (not all of the values between the intervals necessarily exist, so it could be less than the difference of Bottom - Top). This will then modify rdf to make a new file:
new_rdf is formatted as such:
ID Top Bottom avgValue1 avgValue2 avgValue3 ThicknessCount(ft)
1 2010 3000 54 .14 456 74
1 4300 4500 23 .18 632 124
1 4550 5000 34 .24 780 111
1 7100 7700 54 .19 932 322
2 3200 4100 52 .32 134 532
2 4120 4180 16 .11 111 32
2 4300 5300 63 .29 872 873
2 5500 5520 33 .27 1111 9
3 2300 2380 63 .13 1442 32
3 3200 4500 37 .14 1839 87
I've been going back and forth on the best way to do this. I tried mimicking this time series example: Sum set of values from pandas dataframe within certain time frame
but it doesn't seem translatable:
import pandas as pd
Top = rdf['Top']
Bottom = rdf['Bottom']
Depths = ldf['DEPTH']
def get_depths(x):
n = ldf[(ldf['DEPTH']>x['top']) & (ldf['DEPTH']<x['bottom'])]
return n['ID'].values[0],n['DEPTH'].sum()
test = pd.DataFrame({'top':Top, 'bottom':Bottom})
test[['ID','Value1']] = test.apply(lambda x : get_depths(x),1).apply(pd.Series)
I get "TypeError: Invalid comparison between dtype=float64 and str"
And it works if I use the samples they made in the post, but it doesn't work with my data. I'm also hoping there's a more, simple way to do this.
Edit # 2A:
Note:
Sample DataFrame below is not exactly the same as posted in question
Posting a new code here that does uses Top and Bottom from rdf to check for DEPTH in ldf to calculate .mean() for each group using for-loop. A range_key is created in rdf that is unique to each row, assuming that the DataFrame rdf does not have any duplicates.
# Import libraries
import pandas as pd
# Create DataFrame
rdf = pd.DataFrame({
'ID': [1,1,1,1,2,2,2,2,3,3],
'Top': [2000,4300,4500,7100,3200,4120,4300,5500,2300,3200],
'Bottom':[2500,4500,5000,7700,4100,4180,5300,5520,2380,4500]
})
ldf = pd.DataFrame({
'ID': [1,1,1,1,1,1,1,1,2,2,3],
'DEPTH': [2000,2000.5,2001,2001.5,4002,4002.5,5003,5003.5,2000,2000,2000],
'Value1':[45,43,40,28,40,23,34,54,40,38,43],
'Value2':[.32,.33,.12,.10,.10,.11,.08,.04,.28,.25,.30],
'Value3':[423,500,643,20,34,60,900,1002,560,654,343]
})
# Create a key for merge later
ldf['range_key'] = np.nan
rdf['range_key'] = np.linspace(1,rdf.shape[0],rdf.shape[0]).astype(int).astype(str)
# Flag each row for a range
for i in range(ldf.shape[0]):
for j in range(rdf.shape[0]):
d = ldf['DEPTH'][i]
if (d>= rdf['Top'][j]) & (d<=rdf['Bottom'][j]):
rkey = rdf['range_key'][j]
ldf['range_key'][i]=rkey
break;
ldf['range_key'] = ldf['range_key'].astype(int).astype(str) # Convert to string
# Calculate mean for groups
ldf_mean = ldf.groupby(['ID','range_key']).mean().reset_index()
ldf_mean = ldf_mean.drop(['DEPTH'], axis=1)
# Merge into 'rdf'
new_rdf = rdf.merge(ldf_mean, on=['ID','range_key'], how='left')
new_rdf = new_rdf.drop(['range_key'], axis=1)
new_rdf
Output:
ID Top Bottom Value1 Value2 Value3
0 1 2000 2500 39.0 0.2175 396.5
1 1 4300 4500 NaN NaN NaN
2 1 4500 5000 NaN NaN NaN
3 1 7100 7700 NaN NaN NaN
4 2 3200 4100 NaN NaN NaN
5 2 4120 4180 NaN NaN NaN
6 2 4300 5300 NaN NaN NaN
7 2 5500 5520 NaN NaN NaN
8 3 2300 2380 NaN NaN NaN
9 3 3200 4500 NaN NaN NaN
Edit # 1:
Code below seems to work. Added an if-statement to the return from the code posted in question above. Not sure if this is what you were looking to get. It calculates the .sum(). The first value in rdf is changed to a lower the range to match the data in ldf.
# Import libraries
import pandas as pd
# Create DataFrame
rdf = pd.DataFrame({
'ID': [1,1,1,1,2,2,2,2,3,3],
'Top': [2000,4300,4500,7100,3200,4120,4300,5500,2300,3200],
'Bottom':[2500,4500,5000,7700,4100,4180,5300,5520,2380,4500]
})
ldf = pd.DataFrame({
'ID': [1,1,1,1,1,1,1,1,2,2,3],
'DEPTH': [2000,2000.5,2001,2001.5,2002,2002.5,2003,2003.5,2000,2000,2000],
'Value1':[45,43,40,28,40,23,34,54,40,38,43],
'Value2':[.32,.33,.12,.10,.10,.11,.08,.04,.28,.25,.30],
'Value3':[423,500,643,20,34,60,900,1002,560,654,343]
})
##### Code from the question (copy-pasted here)
Top = rdf['Top']
Bottom = rdf['Bottom']
Depths = ldf['DEPTH']
def get_depths(x):
n = ldf[(ldf['DEPTH']>x['top']) & (ldf['DEPTH']<x['bottom'])]
if (n.shape[0]>0):
return n['ID'].values[0],n['DEPTH'].sum()
test = pd.DataFrame({'top':Top, 'bottom':Bottom})
test[['ID','Value1']] = test.apply(lambda x : get_depths(x),1).apply(pd.Series)
Output:
test
top bottom ID Value1
0 2000 2500 1.0 14014.0
1 4300 4500 NaN NaN
2 4500 5000 NaN NaN
3 7100 7700 NaN NaN
4 3200 4100 NaN NaN
5 4120 4180 NaN NaN
6 4300 5300 NaN NaN
7 5500 5520 NaN NaN
8 2300 2380 NaN NaN
9 3200 4500 NaN NaN
Sample data and imports
import pandas
import numpy
import random
# dfr
rdata = {'ID': [1, 1, 1, 1, 2, 2, 2, 2, 3, 3],
'Top': [2010, 4300, 4550, 7100, 3200, 4120, 4300, 5500, 2300, 3200],
'Bottom': [3000, 4500, 5000, 7700, 4100, 4180, 5300, 5520, 2380, 4500]}
dfr = pd.DataFrame(rdata)
# display(dfr.head())
ID Top Bottom
0 1 2010 3000
1 1 4300 4500
2 1 4550 5000
3 1 7100 7700
4 2 3200 4100
# df
np.random.seed(365)
random.seed(365)
rows = 10000
data = {'id': [random.choice([1, 2, 3]) for _ in range(rows)],
'depth': [np.random.randint(2000, 8000) for _ in range(rows)],
'v1': [np.random.randint(40, 50) for _ in range(rows)],
'v2': np.random.rand(rows),
'v3': [np.random.randint(20, 1000) for _ in range(rows)]}
df = pd.DataFrame(data)
df.sort_values(['id', 'depth'], inplace=True)
df.reset_index(drop=True, inplace=True)
# display(df.head())
id depth v1 v2 v3
0 1 2004 48 0.517014 292
1 1 2004 41 0.997347 859
2 1 2006 42 0.278217 851
3 1 2006 49 0.570363 32
4 1 2009 43 0.462985 409
Use each row of dfr to filter and extract stats from df
There are plenty of answers on SO dealing with "TypeError: Invalid comparison between dtype=float64 and str". The numeric columns need to be cleaned of any value that can't be converted to a numeric type.
This code deals with using one dataframe to filter and return metrics for another dataframe.
For each row in dfr:
Filter df
Aggregate the mean and count for v1, v2 and v3
.T to transpose the mean and count rows to columns
Convert to a numpy array
Slice the array for the 3 means and append the array to the v_mean
Slice the array for the max count and append the value to count
They could be all the same, if there are no NaNs in the data
Convert the list of arrays, v_mean to a dataframe, and join it to dfr_new
Add counts a column in dfr_new
v_mean = list()
counts = list()
for idx, (i, t, b) in dfr.iterrows(): # iterate through each row of dfr
data = df[['v1', 'v2', 'v3']][(df.id == i) & (df.depth >= t) & (df.depth <= b)].agg(['mean', 'count']).T.to_numpy() # apply filters and get stats
v_mean.append(data[:, 0]) # get the 3 means
counts.append(data[:, 1].max()) # get the max of the 3 counts; each column has a count, the count cound be different if there are NaNs in data
# copy dfr to dfr_new
dfr_new = dfr.copy()
# add stats values
dfr_new = dfr_new.join(pd.DataFrame(v_mean, columns=['v1_m', 'v2_m', 'v3_m']))
dfr_new['counts'] = counts
# display(dfr_new)
ID Top Bottom v1_mean v2_mean v3_mean count
0 1 2010 3000 44.577491 0.496768 502.068266 542.0
1 1 4300 4500 44.555556 0.518066 530.968254 126.0
2 1 4550 5000 44.446281 0.538855 482.818182 242.0
3 1 7100 7700 44.348083 0.489983 506.681416 339.0
4 2 3200 4100 44.804040 0.487011 528.707071 495.0
5 2 4120 4180 45.096774 0.526687 520.967742 31.0
6 2 4300 5300 44.476980 0.529476 523.095764 543.0
7 2 5500 5520 46.000000 0.608876 430.500000 12.0
8 3 2300 2380 44.512195 0.456632 443.195122 41.0
9 3 3200 4500 44.554755 0.516616 501.841499 694.0
I am trying to build a dataframe where the data is grabbed from multiple files. I have created an empty dataframe with the desired shape, but I am having trouble grabbing the data. I found this but when I concat, I am still getting NaN values.
Edit2: I changed the order of df creation and put concat inside the for loop and same result. (for obvious reasons)
import pandas as pd
import os
import glob
def daily_country_framer():
# create assignments
country_source = r"C:\Users\USER\PycharmProjects\Corona Stats\Country Series"
list_of_files = glob.glob(country_source + r"\*.csv")
latest_file = max(list_of_files, key=os.path.getctime)
last_frame = pd.read_csv(latest_file)
date_list = []
label_list = []
# build date_list values
for file in os.listdir(country_source):
file = file.replace('.csv', '')
date_list.append(file)
# build country_list values
for country in last_frame['Country']:
label_list.append(country)
# create dataframe for each file in folder
for filename in os.listdir(country_source):
filepath = os.path.join(country_source, filename)
if not os.path.isfile(filepath):
continue
df1 = pd.read_csv(filepath)
df = pd.DataFrame(index=label_list, columns=date_list)
df1 = pd.concat([df])
print(df1)
daily_country_framer()
Two sample dataframes: (notice the different shapes)
Country Confirmed Deaths Recovered
0 World 1595350 95455 353975
1 Afghanistan 484 15 32
2 Albania 409 23 165
3 Algeria 1666 235 347
4 Andorra 583 25 58
.. ... ... ... ...
180 Vietnam 255 0 128
181 West Bank and Gaza 263 1 44
182 Western Sahara 4 0 0
183 Zambia 39 1 24
184 Zimbabwe 11 3 0
[185 rows x 4 columns]
Country Confirmed Deaths Recovered
0 World 1691719 102525 376096
1 Afghanistan 521 15 32
2 Albania 416 23 182
3 Algeria 1761 256 405
4 Andorra 601 26 71
.. ... ... ... ...
181 West Bank and Gaza 267 2 45
182 Western Sahara 4 0 0
183 Yemen 1 0 0
184 Zambia 40 2 25
185 Zimbabwe 13 3 0
[186 rows x 4 columns]
Current output:
01-22-2020 01-23-2020 ... 04-09-2020 04-10-2020
World NaN NaN ... NaN NaN
Afghanistan NaN NaN ... NaN NaN
Albania NaN NaN ... NaN NaN
Algeria NaN NaN ... NaN NaN
Andorra NaN NaN ... NaN NaN
... ... ... ... ... ...
West Bank and Gaza NaN NaN ... NaN NaN
Western Sahara NaN NaN ... NaN NaN
Yemen NaN NaN ... NaN NaN
Zambia NaN NaN ... NaN NaN
Zimbabwe NaN NaN ... NaN NaN
[186 rows x 80 columns]
Desired output: (where NaN equals corresponding values from target column or a list of all columns ie: if ['Confirmed'] then 0,1,2,3,4, if all then [0,0,0],[1,0,0],[2,0,0])
Your code (with comments inline):
import pandas as pd
import os
import glob
def daily_country_framer():
# create assignments
country_source = r"C:\Users\USER\PycharmProjects\Corona Stats\Country Series"
list_of_files = glob.glob(country_source + r"\*.csv")
latest_file = max(list_of_files, key=os.path.getctime)
last_frame = pd.read_csv(latest_file)
date_list = []
label_list = []
# build date_list values
for file in os.listdir(country_source):
file = file.replace('.csv', '')
date_list.append(file)
# build country_list values
for country in last_frame['Country']: # == last_frame['Country'].tolist()
label_list.append(country)
# create dataframe for each file in folder
for filename in os.listdir(country_source):
filepath = os.path.join(country_source, filename)
if not os.path.isfile(filepath):
continue
df1 = pd.read_csv(filepath)
# you redefine df1 for every file in the loop. So if there
# are 10 files, only the last one is actually used anywhere
# outside this loop.
df = pd.DataFrame(index=label_list, columns=date_list)
df1 = pd.concat([df])
# here you just redefined df1 again as the concatenation of the
# empty dataframe you just created in the line above.
print(df1)
daily_country_framer()
So hopefully that illuminates why you were getting the results you were getting. It was doing exactly what you asked it to do.
What you want to do is get a dictionary with dates as keys and the associated dataframe as values, then concatenate that. This can be quite expensive because of some quirks with how pandas does concatenation, but if you concatenate along axis=0, you should be fine.
A better way might be the following:
import pandas as pd
import os
def daily_country_framer(country_source):
accumulator = {}
# build date_list values
for filename in os.listdir(country_source):
date = filename.replace('.csv', '')
filepath = os.path.join(country_source, filename)
accumulator[date] = pd.read_csv(filepath)
# now we have a dictionary of {date : data} -- perfect!
df = pd.concat(accumulator)
return df
daily_country_framer("C:\Users\USER\PycharmProjects\Corona Stats\Country Series")
Does that work?