Copying existing columns as moving averages to a dataframe - python

I think I am overthinking this - I am trying to copy existing pandas data frame columns and values and making rolling averages - I do not want to overwrite original data. I am iterating over the columns, taking the columns and values, making a rolling 7 day ma as a new column with the suffix _ma as a copy to the original copy. I want to compare existing data to the 7day MA and see how many standard dev the data is from the 7 day MA - which I can figure out - I am just trying to save MA data as a new data frame.
I have
for column in original_data[ma_columns]:
ma_df = pd.DataFrame(original_data[ma_columns].rolling(window=7).mean(), columns = str(column)+'_ma')
and getting the error : Index(...) must be called with a collection of some kind, 'Carrier_AcctPswd_ma' was passed
But if I am iterating with
for column in original_data[ma_columns]:
print('Colunm Name : ', str(column)+'_ma')
print('Contents : ', original_data[ma_columns].rolling(window=7).mean())
I get the data I need :
My issue is just saving this as a new data frame, which I can concatenate to the old, and then do my analysis.
EDIT
I have now been able to make a bunch of data frames, but I want to concatenate them together and this is where the issue is:
for column in original_data[ma_columns]:
MA_data = pd.DataFrame(original_data[column].rolling(window=7).mean())
for i in MA_data:
new = pd.concat(i)
print(i)
<ipython-input-75-7c5e5fa775b3> in <module>
17 # print(type(MA_data))
18 for i in MA_data:
---> 19 new = pd.concat(i)
20 print(i)
21
~\Anaconda3\lib\site-packages\pandas\core\reshape\concat.py in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
279 verify_integrity=verify_integrity,
280 copy=copy,
--> 281 sort=sort,
282 )
283
~\Anaconda3\lib\site-packages\pandas\core\reshape\concat.py in __init__(self, objs, axis, join, keys, levels, names, ignore_index, verify_integrity, copy, sort)
307 "first argument must be an iterable of pandas "
308 "objects, you passed an object of type "
--> 309 '"{name}"'.format(name=type(objs).__name__)
310 )
311
TypeError: first argument must be an iterable of pandas objects, you passed an object of type "str"

You should iterate over column names and assign the resulting pandas series as a new named column, for example:
import pandas as pd
original_data = pd.DataFrame({'A': range(100), 'B': range(100, 200)})
ma_columns = ['A', 'B']
for column in ma_columns:
new_column = column + '_ma'
original_data[new_column] = pd.DataFrame(original_data[column].rolling(window=7).mean())
print(original_data)
Output dataframe:
A B A_ma B_ma
0 0 100 NaN NaN
1 1 101 NaN NaN
2 2 102 NaN NaN
3 3 103 NaN NaN
4 4 104 NaN NaN
.. .. ... ... ...
95 95 195 92.0 192.0
96 96 196 93.0 193.0
97 97 197 94.0 194.0
98 98 198 95.0 195.0
99 99 199 96.0 196.0
[100 rows x 4 columns]

Related

Iterate over specific rows, sum results and store in new row

I have a DataFrame in which I have already defined rows to be summed up and store the results in a new row.
For example in Year 1990:
Category
A
B
C
D
Year
E
147
78
476
531
1990
F
914
356
337
781
1990
G
117
874
15
69
1990
H
45
682
247
65
1990
I
20
255
465
19
1990
Here, the rows G - H should be summed up and the results stored in a new row. The same categories repeat every year from 1990 - 2019
I have already tried it with .iloc e.g. [4:8], [50:54] [96:100] and so on, but with iloc I can not specify multiple index. I can't manage to make a loop over the single years.
Is there a way to sum the values in categories (G-H) for each year (1990 -2019)?
I'm not sure the multiple index what you mean.
It usually appear after some group and aggregate function.
At your table, it looks just multiple column
So, if I understand correctly.
Here a complete code to show how to use the multiple condition of DataFrame
import io
import pandas as pd
data = """Category A B C D Year
E 147 78 476 531 1990
F 914 356 337 781 1990
G 117 874 15 69 1990
H 45 682 247 65 1990
I 20 255 465 19 1990"""
table = pd.read_csv(io.StringIO(data), delimiter="\t")
years = table["Year"].unique()
for year in years:
row = table[((table["Category"] == "G") | (table["Category"] == "H")) & (table["Year"] == year)]
row = row[["A", "B", "C", "D"]].sum()
row["Category"], row["Year"] = "sum", year
table = table.append(row, ignore_index=True)
If you are only interested in G/H, you can slice with isin combined with boolean indexing, then sum:
df[df['Category'].isin(['G', 'H'])].sum()
output:
Category GH
A 162
B 1556
C 262
D 134
Year 3980
dtype: object
NB. note here the side effect of sum that combines the two "G"/"H" strings into one "GH".
Or, better, set Category as index and slice with loc:
df.set_index('Category').loc[['G', 'H']].sum()
output:
A 162
B 1556
C 262
D 134
Year 3980
dtype: int64

Can't extract first column Pandas series

I have a series object w/ following shape
H AB HBP SF G 2B 3B HR BAVG
playerID
ruthba01 2873 8398 43.0 0.0 2503 506 136 714 0.342105
willite01 2654 7706 39.0 20.0 2292 525 71 521 0.344407
gehrilo01 2721 8001 45.0 0.0 2164 534 163 493 0.340082
hornsro01 2930 8173 48.0 0.0 2259 541 169 301 0.358497
I am trying to extract the first column(playerIDs) and convert it to a list. However, list = df['playerID'] gives KeyError, iLoc[0] will return the H column. FYI its a series object.
Thanks
You have the playerID in index
df = df.reset_index()
Then you can call your .iloc and df['playerID']
Or we do not need reset
l = df.index.tolist()
If that's the index you just need df.index I believe.
Looks like the first column might be set to an index. To turn it to a list you will use df.index.tolist()

Interpolate and match missing values between two dataframes of different dimensions

I'm new to pandas and python in general.
Currently I'm trying to interpolate and make the coordinates of two different dataframes match. The data comes from two different GEOTIFF files from the same source, one being temperature and the other being radiation. The file was converted to pandas with georasters.
The radiation dataframe has more points and data, I want to upscale the temperature dataframe and have the same coordinates as the prior.
Radiation Dataframe:
row
col
value
x
y
0
197
2427
5.755
-83.9325
17.5075
1
197
2428
5.755
-83.93
17.5075
2
197
2429
5.755
-83.9275
17.5075
3
197
2430
5.755
-83.925
17.5075
4
197
2431
5.755
-83.9225
17.5075
1850011 rows × 5 columns
Temperature Dataframe:
row
col
value
x
y
0
59
725
26.8
-83.9583
17.5083
1
59
726
26.8
-83.95
17.5083
2
59
727
26.8
-83.9417
17.5083
3
59
728
26.8
-83.9333
17.5083
4
59
729
26.8
-83.925
17.5083
167791 rows × 5 columns
Source of data
"Gis data - LTAym_AvgDailyTotals (GeoTIFF)"
Temperature Map
Radiation (GHI) Map
In order to be able to change the information of a column, you have to use iloc. I said I gave the fourth column from the left and index 3, which is the same as column x, then I gave it your values, then I printed the result.
import pandas as pd
Radiation = {'row':["197","197","197","197","197"],
'col':["2427","2428","2429","2430","2431"],
'value':['5.755','5.755','5.755','5.755','5.755'],
'x':['-83.9325','-83.93','-83.9275','-83.925','-83.9225'],
'y':['17.5075','17.5075','17.5075','17.5075','17.5075']
}
Temperature = { 'row':["59","59","59","59","59"],
'col':["725","726","727","728","729"],
'value':["26.8","26.8","26.8","26.8","26.8"],
'x':["-83.9583","-83.95","-83.9417","-83.9333","-83.925"],
'y':["17.5083","17.5083","17.5083","17.5083","17.5083"]
}
df1 = pd.DataFrame(Radiation)
df2 = pd.DataFrame(Temperature)
df1.iloc[4:,3]='1850011'
df2.iloc[4:,3]='167791'
Comparison = df1.compare(df2, keep_shape=True, keep_equal=True)
print(df1)
print(df2)

Reading a pandas data frame having unequal columns in observations

I am trying to read this small data file,
Link - https://drive.google.com/open?id=1nAS5mpxQLVQn9s_aAKvJt8tWPrP_DUiJ
I am using the code -
df = pd.read_table('/Data/123451_date.csv', sep=';', index_col=0, engine='python', error_bad_lines=False)
It has ';' as a seprator, and values are missing in the file for some columns values in some observations (or rows).
How can I read it properly. I see the current dataframe, which is not loaded properly.
It looks like the data you use has some garbage in it. Precisely, rows 1-33 (inclusive) have additional, unnecessary (non-GPS) information included. You can either fix the database by manually removing the unneeded information from the datasheet, or use following code snippet to skip the rows that include it:
from pandas import read_table
data = read_table('34_2017-02-06.gpx.csv', sep=';', skiprows=list(range(1, 34)).drop("Unnamed: 28", axis=1)
The drop("Unnamed: 28", axis=1) is simply there to remove an additional column that is created probably due to each row in your datasheet ending with a ; (because it reads the empty space at the end of each line as data).
The result of print(data.head()) is then as follows:
index cumdist ele ... esttotalpower lat lon
0 49 340 -34.8 ... 9 52.077362 5.114530
1 51 350 -34.8 ... 17 52.077468 5.114543
2 52 360 -35.0 ... -54 52.077521 5.114551
3 53 370 -35.0 ... -173 52.077603 5.114505
4 54 380 -34.8 ... 335 52.077677 5.114387
[5 rows x 28 columns]
To explain the role of the drop command even more, here is what would happen without it (notice the last, weird column)
index cumdist ele ... lat lon Unnamed: 28
0 49 340 -34.8 ... 52.077362 5.114530 NaN
1 51 350 -34.8 ... 52.077468 5.114543 NaN
2 52 360 -35.0 ... 52.077521 5.114551 NaN
3 53 370 -35.0 ... 52.077603 5.114505 NaN
4 54 380 -34.8 ... 52.077677 5.114387 NaN
[5 rows x 29 columns]

Join dataframe with matrix output using pandas

I am trying to translate the input dataframe (inp_df) to output dataframe (out_df) using the the data from the cell based intermediate dataframe (matrix_df) as shown below.
There are several cell number based files with distance values shown in matrix_df .
The program iterates by cell & fetches data from appropriate file so each time matrix_df will have the data for all rows of the current cell# that we are iterating for in inp_df.
inp_df
A B cell
100 200 1
115 270 1
145 255 2
115 266 1
matrix_df (cell_1.csv)
B 100 115 199 avg_distance
200 7.5 80.7 67.8 52
270 6.8 53 92 50
266 58 84 31 57
matrix_df (cell_2.csv)
B 145 121 166 avg_distance
255 74.9 77.53 8 53.47
out_df dataframe
A B cell distance avg_distance
100 200 1 7.5 52
115 270 1 53 50
145 255 2 74.9 53.47
115 266 1 84 57
My current thought process for each cell# based data is
use a apply function to go row by row
then use a join based on column B in the inp_df with with matrix_df, where the matrix df is somehow translated into a tuple of column name, distance & average distance.
But I am looking for a pandonic way of doing this since my approach will slow down when there are millions of rows in the input. I am specifically looking for core logic inside an iteration to fetch the matches, since in each cell the number of columns in matrix_df would vary
If its any help the matrix files is the distance based outputs from sklearn.metrics.pairwise.pairwise_distances .
NB: In inp_df the value of column B is unique and values of column A may or may not be unique
Also the matrix_dfs first column was empty & i had renamed it with the following code for easiness in understanding since it was a header-less matrix output file.
dist_df = pd.read_csv(mypath,index_col=False)
dist_df.rename(columns={'Unnamed: 0':'B'}, inplace=True)​
Step 1: Concatenate your inputs with pd.concat and merge with inp_df using df.merge
In [641]: out_df = pd.concat([matrix_df1, matrix_df2]).merge(inp_df)
Step 2: Create the distance column with df.apply by using A's values to index into the correct column
In [642]: out_df.assign(distance=out_df.apply(lambda x: x[str(int(x['A']))], axis=1))\
[['A', 'B', 'cell', 'distance', 'avg_distance']]
Out[642]:
A B cell distance avg_distance
0 100 200 1 7.5 52.00
1 115 270 1 53.0 50.00
2 115 266 1 84.0 57.00
3 145 255 2 74.9 53.47

Categories

Resources