Concat two array in a specific condition? - python

I require to concat two arrays of unequal size:
Array-1:
A = ["year","month","day","hour","minute","second", "a", "b", "c", "d"]
data1 = pd.read_csv('event_5.txt',sep='\t',names=A)
array1=data1[['year', 'month', 'day']]
Array-2:
B=["station", "phase", "hour", "minute", "second"]
arr_data = pd.read_csv('arrival_5.txt',sep='\t',names=B)
ar_t= arr_data[['hour', 'minute', 'second']]
array2 = pd.DataFrame(ar_t)
The required output is shown below: here, [2019 11 9] is the array-1 reshaped to match the dimensions of the second array and then concat. However, in the case of reshaping, I need to check the dimensions of the second array every time. Therefore, I need an automated script that can concat unequal arrays.
Array-1: The first array always have the same dimensions
year month day
0 2019 11 9
Array-2: Variable dimension columns are fixed but rows change for each iteration:
hour minute second
0 14 57 41.80
1 14 58 3.47
2 14 57 25.99
3 14 57 37.00
4 14 57 29.86
5 14 57 40.24
6 14 57 32.61
7 14 57 42.26
8 14 57 29.74
9 14 57 42.36
10 14 57 46.00
11 14 58 8.69
12 14 57 34.50
13 14 57 48.97
14 14 57 30.30
15 14 57 39.78
16 14 57 32.45
17 14 57 47.83
18 14 57 25.86
19 14 57 36.30
20 14 57 17.90
21 14 57 23.40
22 14 57 34.64
23 14 57 50.95
24 14 57 35.90
25 14 57 50.64
Required output:
Year month day hour minute second
0 2019 11 9 14 57 41.80
1 2019 11 9 14 58 3.47
2 2019 11 9 14 57 25.99
3 2019 11 9 14 57 37.00
4 2019 11 9 14 57 29.86
5 2019 11 9 14 57 40.24
6 2019 11 9 14 57 32.61
7 2019 11 9 14 57 42.26
8 2019 11 9 14 57 29.74
9 2019 11 9 14 57 42.36
10 2019 11 9 14 57 46.00
11 2019 11 9 14 58 8.69
12 2019 11 9 14 57 34.50
13 2019 11 9 14 57 48.97
14 2019 11 9 14 57 30.30
15 2019 11 9 14 57 39.78
16 2019 11 9 14 57 32.45
17 2019 11 9 14 57 47.83
18 2019 11 9 14 57 25.86
19 2019 11 9 14 57 36.30
20 2019 11 9 14 57 17.90
21 2019 11 9 14 57 23.40
22 2019 11 9 14 57 34.64
23 2019 11 9 14 57 50.95
24 2019 11 9 14 57 35.90
25 2019 11 9 14 57 50.64

Assigning a constant value to a DataFrame column
If your first array is always a single-row dataframe, or a monodimensional array, then you can just use pandas to assign a constant value to a column.
The syntax is my_dataframe["new_column"] = constant_value.
Because arr1 is a DataFrame, accessing a column will give us a Series. To get its constant value, then, we need to take the value in cell indexed by 0 - or the first row.
In your case this becomes:
>>> type(arr1), type(arr2)
(pandas.core.frame.DataFrame, pandas.core.frame.DataFrame)
>>> arr2["year"] = arr1["year"].loc[0]
>>> arr2["month"] = arr1["month"].loc[0]
>>> arr2["day"] = arr1["day"].loc[0]
>>> arr2
hours minutes seconds year month day
0 9 6 22.001464 2019 11 9
1 8 21 28.412044 2019 11 9
2 10 7 22.433552 2019 11 9
3 18 37 19.551359 2019 11 9
4 19 1 40.722019 2019 11 9
.. ... ... ... ... ... ...
95 2 16 48.368643 2019 11 9
96 19 22 25.034936 2019 11 9
97 10 0 20.163870 2019 11 9
98 16 35 27.251357 2019 11 9
99 8 26 54.200897 2019 11 9
Remember that this will work in-place, modifying arr2 object.
Accessing the numpy array behind the DataFrame
If you need the multidimensional array, you can just call:
>>> arr2_np = arr2.to_numpy()
Sorting columns based on your use-case
If you need to sort the columns, you can just take a different view of them, like this:
>>> cols = arr2.columns.to_list()
>>> cols2 = cols[3:] + cols[:3]
>>> arr2[cols2]
year month day hours minutes seconds
0 2019 11 9 9 6 22.001464
1 2019 11 9 8 21 28.412044
2 2019 11 9 10 7 22.433552
3 2019 11 9 18 37 19.551359
4 2019 11 9 19 1 40.722019
.. ... ... ... ... ... ...
95 2019 11 9 2 16 48.368643
96 2019 11 9 19 22 25.034936
97 2019 11 9 10 0 20.163870
98 2019 11 9 16 35 27.251357
99 2019 11 9 8 26 54.200897

this worked for me:
import numpy as np
arr1=[2019, 12, 17]
arr2=[12, 34, 17,
18, 17, 36,
15, 23, 40]
print(arr1,arr2)
output:
[2019, 12, 17] [12, 34, 17, 18, 17, 36, 15, 23, 40]
arr2 = np.array(arr2).reshape((3,3))
arr1 = np.array([arr1,]*3)
newArray = np.hstack((arr1,arr2))
output:
array([[2019, 12, 17, 12, 34, 17],
[2019, 12, 17, 18, 17, 36],
[2019, 12, 17, 15, 23, 40]])
update, to increase performance for large datasets, simple only stack new value after the array is once reshaped:
arr1=[2019, 12, 17]
newEntry = [1,2,3]
nE = np.hstack((arr1,newEntry))
np.vstack((newArray,nE))
output:
array([[2019, 12, 17, 12, 34, 17],
[2019, 12, 17, 18, 17, 36],
[2019, 12, 17, 15, 23, 40],
[2019, 12, 17, 1, 2, 3]])
update without knowledge of exakt dimension you can simply use:
np.arange(arr2).reshape(-1, 3)

You can use numpy.column_stack:
np.column_stack((array_1, array_2))
a
#array([[0, 1, 2],
# [3, 4, 5]])
b
#array([0, 1])
np.column_stack((a, b))
#array([[0, 1, 2, 0],
# [3, 4, 5, 1]])

Related

Create bi-weekly and monthly labels with week numbers in pandas

I have a dataframe with profit values, IDs, and week values. It looks a little like this
ID
Week
Profit
A
1
2
A
2
2
A
3
0
A
4
0
I want to create two new columns called "Bi-Weekly" and "Monthly", so week 1 would be label 2, week 2 would also be label 2, but week 3 would be labeled 4, and week 4 would be labeled 4, and they would all be labeled month 1, so I could groupby weekly, bi-weekly, or monthly profit as needed. Right now I've created two functions which work, but the weeks are going to go up to a year (52 weeks) so I was wondering if there's a more efficient way. My bi-weekly function below.
def biweek(prof_calc):
if (prof_calc['week']==2):
return 2
elif (prof_calc['week']==3):
return 2
elif (prof_calc['week']==4):
return 4
elif (prof_calc['week']==5):
return 4
elif (prof_calc['week']==6):
return 6
elif (prof_calc['week']==7):
return 6
elif (prof_calc['week']==8):
return 8
elif (prof_calc['week']==9):
return 8
elif (prof_calc['week']==10):
return 10
elif (prof_calc['week']==11):
return 10
prof_calc['BiWeek'] = prof_calc.apply(biweek, axis=1)
IIUC, you could try:
df["Biweekly"] = (df["Week"]-1)//2+1
df["Monthly"] = (df["Week"]-1)//4+1
>>> df
ID Week Profit Biweekly Monthly
0 A 1 42 1 1
1 A 2 69 1 1
2 A 3 53 2 1
3 A 4 63 2 1
4 A 5 56 3 2
5 A 6 57 3 2
6 A 7 86 4 2
7 A 8 23 4 2
8 A 9 35 5 3
9 A 10 10 5 3
10 A 11 25 6 3
11 A 12 21 6 3
12 A 13 39 7 4
13 A 14 82 7 4
14 A 15 76 8 4
15 A 16 20 8 4
16 A 17 97 9 5
17 A 18 67 9 5
18 A 19 21 10 5
19 A 20 22 10 5
20 A 21 88 11 6
21 A 22 67 11 6
22 A 23 33 12 6
23 A 24 38 12 6
24 A 25 8 13 7
25 A 26 67 13 7
26 A 27 16 14 7
27 A 28 49 14 7
28 A 29 3 15 8
29 A 30 17 15 8
30 A 31 79 16 8
31 A 32 19 16 8
32 A 33 21 17 9
33 A 34 9 17 9
34 A 35 56 18 9
35 A 36 83 18 9
36 A 37 1 19 10
37 A 38 53 19 10
38 A 39 66 20 10
39 A 40 55 20 10
40 A 41 85 21 11
41 A 42 90 21 11
42 A 43 34 22 11
43 A 44 3 22 11
44 A 45 9 23 12
45 A 46 28 23 12
46 A 47 58 24 12
47 A 48 14 24 12
48 A 49 42 25 13
49 A 50 69 25 13
50 A 51 76 26 13
51 A 52 49 26 13

Print list of lists in matrix format

I have a list of lists:
[[15 16 18 19 12 11],[13 19 23 21 16 12],[12 15 17 19 20 10],[10 14 16 13 9 6]]
The length of each list in the list is the same.
I want to print out as rows and columns such as:
15 16 18 19 12 11
13 19 23 21 16 12
12 15 17 19 20 10
10 14 16 13 9 6
I know I can do it by using
lst = (' '.join(map(str,lst))),
But I want every integer to indent at the same level like the 9 should be indented below the 0 of 20, and 6 should be under 0 of 10.
Given an input (list of lists) ll:
'\n'.join(' '.join('%2d' % x for x in l) for l in ll)
Result:
15 16 18 19 12 11
13 19 23 21 16 12
12 15 17 19 20 10
10 14 16 13 9 6

Pandas: compute numerous columns of percentage values

I'm failing to loop through the values of select dataframe columns in order to create new columns representing percentage values. Reproducible example:
data = {'Respondents': [90, 43, 89, '89', '67', '88', '73', '78', '62', '101'],
'answer_1': [51, 15, 15, 61, 16, 14, 15, 1, 0, 16],
'answer_2': [11, 12, 14, 40, 36, 78, 12, 0, 26, 78],
'answer_3': [3, 8, 4, 0, 2, 7, 10, 11, 6, 7]}
df = pd.DataFrame(data)
df
Respondents answer_1 answer_2 answer_3
0 90 51 11 3
1 43 15 12 8
2 89 15 14 4
3 89 61 35 0
4 67 16 36 2
5 88 14 78 7
6 73 15 12 10
7 78 1 0 11
8 62 0 26 6
9 101 16 78 7
The aim is to compute the percentage for each of the answers columns against the total respondents. For example, for the new answer_1 column – let's name it answer_1_perc – the first value would be 46 (because 51 is 46% of 90), the next value would be 35 (15 is 35% of 43). Then there would be answer_2_perc and answer_3_perc columns.
I've written so many iterations of the following code my head's spinning.
for columns in df.iloc[:, 1:4]:
for i in columns:
i_name = 'percentage_' + str(columns)
i_group = ([i] / df['Respondents'] * 100)
df[i_name] = i_group
What is the best way to do this? I need to use an iterative method as my actual data has 25 answer columns rather than the 3 shown in this example.
You almost had it, note that you have string values in respondents col which I've corrected prior to calling the following:
In [172]:
for col in df.columns[1:4]:
i_name = 'percentage_' + col
i_group = (df[col] / df['Respondents']) * 100
df[i_name] = i_group
df
Out[172]:
Respondents answer_1 answer_2 answer_3 percentage_answer_1 \
0 90 51 11 3 56.666667
1 43 15 12 8 34.883721
2 89 15 14 4 16.853933
3 89 61 40 0 68.539326
4 67 16 36 2 23.880597
5 88 14 78 7 15.909091
6 73 15 12 10 20.547945
7 78 1 0 11 1.282051
8 62 0 26 6 0.000000
9 101 16 78 7 15.841584
percentage_answer_2 percentage_answer_3
0 12.222222 3.333333
1 27.906977 18.604651
2 15.730337 4.494382
3 44.943820 0.000000
4 53.731343 2.985075
5 88.636364 7.954545
6 16.438356 13.698630
7 0.000000 14.102564
8 41.935484 9.677419
9 77.227723 6.930693
I recommend using div and concat:
df['Respondents'] = df['Respondents'].astype(float)
df_pct = (df.drop('Respondents', axis=1)
.div(df['Respondents'], axis=0)
.mul(100)
.rename(columns=lambda col: 'percentage_' + col)
)
pd.concat([df, df_pct], axis=1)
Respondents answer_1 answer_2 answer_3 percentage_answer_1 \
0 90.0 51 11 3 56.666667
1 43.0 15 12 8 34.883721
2 89.0 15 14 4 16.853933
3 89.0 61 40 0 68.539326
4 67.0 16 36 2 23.880597
5 88.0 14 78 7 15.909091
6 73.0 15 12 10 20.547945
7 78.0 1 0 11 1.282051
8 62.0 0 26 6 0.000000
9 101.0 16 78 7 15.841584
percentage_answer_2 percentage_answer_3
0 12.222222 3.333333
1 27.906977 18.604651
2 15.730337 4.494382
3 44.943820 0.000000
4 53.731343 2.985075
5 88.636364 7.954545
6 16.438356 13.698630
7 0.000000 14.102564
8 41.935484 9.677419
9 77.227723 6.930693
Another solution with div desired columns by column Respondents and then add to new columns names:
print ('percentage_' + df.columns[1:4])
Index(['percentage_answer_1', 'percentage_answer_2', 'percentage_answer_3'], dtype='object')
df['percentage_' + df.columns[1:4]] = df.ix[:,1:4].div(df.Respondents, axis=0) * 100
print (df)
Respondents answer_1 answer_2 answer_3 percentage_answer_1 \
0 90 51 11 3 56.666667
1 43 15 12 8 34.883721
2 89 15 14 4 16.853933
3 89 61 40 0 68.539326
4 67 16 36 2 23.880597
5 88 14 78 7 15.909091
6 73 15 12 10 20.547945
7 78 1 0 11 1.282051
8 62 0 26 6 0.000000
9 101 16 78 7 15.841584
percentage_answer_2 percentage_answer_3
0 12.222222 3.333333
1 27.906977 18.604651
2 15.730337 4.494382
3 44.943820 0.000000
4 53.731343 2.985075
5 88.636364 7.954545
6 16.438356 13.698630
7 0.000000 14.102564
8 41.935484 9.677419
9 77.227723 6.930693

Iterator produced by itertools.groupby() is consumed unexpectedly

I have written a small program based on iterators to display a multicolumn calendar.
In that code I am using itertools.groupby to group the dates by month by the function group_by_months(). There I yield the month name and the grouped dates as a list for every month. However, when I let that function directly return the grouped dates as an iterator (instead of a list) the program leaves the days of all but the last column blank.
I can't figure out why that might be. Am I using groupby wrong? Can anyone help me to spot the place where the iterator is consumed or its output is ignored? Why is it especially the last column that "survives"?
Here's the code:
import datetime
from itertools import zip_longest, groupby
def grouper(iterable, n, fillvalue=None):
"""\
copied from the docs:
https://docs.python.org/3.4/library/itertools.html#itertools-recipes
"""
args = [iter(iterable)] * n
return zip_longest(*args, fillvalue=fillvalue)
def generate_dates(start_date, end_date, step=datetime.timedelta(days=1)):
while start_date < end_date:
yield start_date
start_date += step
def group_by_months(seq):
for k,v in groupby(seq, key=lambda x:x.strftime("%B")):
yield k, v # Why does it only work when list(v) is yielded here?
def group_by_weeks(seq):
yield from groupby(seq, key=lambda x:x.strftime("%2U"))
def format_month(month, dates_of_month):
def format_week(weeknum, dates_of_week):
def format_day(d):
return d.strftime("%3e")
weekdays = {d.weekday(): format_day(d) for d in dates_of_week}
return "{0} {7} {1} {2} {3} {4} {5} {6}".format(
weeknum, *[weekdays.get(i, " ") for i in range(7)])
yield "{:^30}".format(month)
weeks = group_by_weeks(dates_of_month)
yield from map(lambda x:format_week(*x), weeks)
start, end = datetime.date(2016,1,1), datetime.date(2017,1,1)
dates = generate_dates(start, end)
months = group_by_months(dates)
formatted_months = map(lambda x: (format_month(*x)), months)
ncolumns = 3
quarters = grouper(formatted_months, ncolumns)
interleaved = map(lambda x: zip_longest(*x, fillvalue=" "*30), quarters)
formatted = map(lambda x: "\n".join(map(" ".join, x)), interleaved)
list(map(print, formatted))
This is the failing output:
January February March
09 1 2 3 4 5
10 6 7 8 9 10 11 12
11 13 14 15 16 17 18 19
12 20 21 22 23 24 25 26
13 27 28 29 30 31
April May June
22 1 2 3 4
23 5 6 7 8 9 10 11
24 12 13 14 15 16 17 18
25 19 20 21 22 23 24 25
26 26 27 28 29 30
July August September
35 1 2 3
36 4 5 6 7 8 9 10
37 11 12 13 14 15 16 17
38 18 19 20 21 22 23 24
39 25 26 27 28 29 30
October November December
48 1 2 3
49 4 5 6 7 8 9 10
50 11 12 13 14 15 16 17
51 18 19 20 21 22 23 24
52 25 26 27 28 29 30 31
This is the expected output:
January February March
00 1 2 05 1 2 3 4 5 6 09 1 2 3 4 5
01 3 4 5 6 7 8 9 06 7 8 9 10 11 12 13 10 6 7 8 9 10 11 12
02 10 11 12 13 14 15 16 07 14 15 16 17 18 19 20 11 13 14 15 16 17 18 19
03 17 18 19 20 21 22 23 08 21 22 23 24 25 26 27 12 20 21 22 23 24 25 26
04 24 25 26 27 28 29 30 09 28 29 13 27 28 29 30 31
05 31
April May June
13 1 2 18 1 2 3 4 5 6 7 22 1 2 3 4
14 3 4 5 6 7 8 9 19 8 9 10 11 12 13 14 23 5 6 7 8 9 10 11
15 10 11 12 13 14 15 16 20 15 16 17 18 19 20 21 24 12 13 14 15 16 17 18
16 17 18 19 20 21 22 23 21 22 23 24 25 26 27 28 25 19 20 21 22 23 24 25
17 24 25 26 27 28 29 30 22 29 30 31 26 26 27 28 29 30
July August September
26 1 2 31 1 2 3 4 5 6 35 1 2 3
27 3 4 5 6 7 8 9 32 7 8 9 10 11 12 13 36 4 5 6 7 8 9 10
28 10 11 12 13 14 15 16 33 14 15 16 17 18 19 20 37 11 12 13 14 15 16 17
29 17 18 19 20 21 22 23 34 21 22 23 24 25 26 27 38 18 19 20 21 22 23 24
30 24 25 26 27 28 29 30 35 28 29 30 31 39 25 26 27 28 29 30
31 31
October November December
39 1 44 1 2 3 4 5 48 1 2 3
40 2 3 4 5 6 7 8 45 6 7 8 9 10 11 12 49 4 5 6 7 8 9 10
41 9 10 11 12 13 14 15 46 13 14 15 16 17 18 19 50 11 12 13 14 15 16 17
42 16 17 18 19 20 21 22 47 20 21 22 23 24 25 26 51 18 19 20 21 22 23 24
43 23 24 25 26 27 28 29 48 27 28 29 30 52 25 26 27 28 29 30 31
As the docs state (c.f.):
when the groupby() object is advanced, the previous group is no longer visible. So, if that data is needed later, it should be stored as a list
That means the iterators are consumed, when the code later accesses the returned iterators out of order, i.e., when the groupby is actually iterated. The iteration happens out of order because of the chunking and interleaving that is done here.
We observe this specific pattern (i.e., only the last column is fully displayed) because of the way we iterate. That is:
The month names for the first line are printed. Thereby the iterators for up to the last column's month are consumed (and their content discarded). The groupby() object produces the last column's month name only after the first columns' data.
We print the first week line. Thereby the already exhausted iterators for the first columns are filled up automatically using the default value passed to zip_longest(). Only the last column still provides actual data.
The same happens for the next lines of month names.

python - replace last n columns with sum of all files

I am novice in python.
I have 8 csv files with 26 columns and 600 rows in each. now I want to take the last 4 column of each csv files (Column 22 to column 25), read the files and sum them up to replace all the 4 columns in each file. for example (I am showing some random data here):
new-1.csv:
a b c d e f g h i j k
1 1 1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 3 3 3
4 4 4 4 4 4 4 4 4 4 4
5 5 5 5 5 5 5 5 5 5 5
6 6 6 6 6 6 6 6 6 6 6
7 7 7 7 7 7 7 7 7 7 7
8 8 8 8 8 8 8 8 8 8 8
9 9 9 9 9 9 9 9 9 9 9
new2.csv:
a b c d e f g h i j k
11 11 11 11 11 11 11 11 11 11 11
12 12 12 12 12 12 12 12 12 12 12
13 13 13 13 13 13 13 13 13 13 13
14 14 14 14 14 14 14 14 14 14 14
15 15 15 15 15 15 15 15 15 15 15
16 16 16 16 16 16 16 16 16 16 16
17 17 17 17 17 17 17 17 17 17 17
18 18 18 18 18 18 18 18 18 18 18
19 19 19 19 19 19 19 19 19 19 19
Now, I want to sum each element of "h, i, j, k" of from these 2 files, then replace the files last 4 columns with this new sum.
Modified new-1.csv:
a b c d e f g h i j k
1 1 1 1 1 1 1 12 12 12 12
2 2 2 2 2 2 2 14 14 14 14
3 3 3 3 3 3 3 16 16 16 16
4 4 4 4 4 4 4 18 18 18 18
5 5 5 5 5 5 5 20 20 20 20
6 6 6 6 6 6 6 22 22 22 22
7 7 7 7 7 7 7 24 24 24 24
8 8 8 8 8 8 8 26 26 26 26
9 9 9 9 9 9 9 28 28 28 28
Modified new-2.csv:
a b c d e f g h i j k
11 11 11 11 11 11 11 12 12 12 12
12 12 12 12 12 12 12 14 14 14 14
13 13 13 13 13 13 13 16 16 16 16
14 14 14 14 14 14 14 18 18 18 18
15 15 15 15 15 15 15 20 20 20 20
16 16 16 16 16 16 16 22 22 22 22
17 17 17 17 17 17 17 24 24 24 24
18 18 18 18 18 18 18 26 26 26 26
19 19 19 19 19 19 19 28 28 28 28
I am assuming I should use Panda or numpy for this, but not sure how to do it. any suggestions/hints would be appreciated.
You can do this by just using numpy.
import numpy as np
# list of all the files
file_list = ['foo.csv','bar.csv','baz.csv'] # all 8 files
col_names = ['a','b','c','d','e','f'] # all the names till z if necessary as the first row, else skip this
# initializing a numpy array, for containing sum from last 4 columns
add_cols = np.zeros((600,4))
# iterating over all .csv files
for file in file_list :
# skiprows will skip the first row and usecols will get values in last 4 cols
temp = np.loadtxt(file, skiprows=1, delimiter=',' , usecols = (22,23,24,25) )
add_cols = np.add(temp,add_cols)
# now again overwriting all the files, substituting the last 4 columns with the sum
for file in file_list :
#loading the content from file in temp
temp = np.loadtxt(file, skiprows=1, delimiter=',')
temp[:,[22,23,24,25]] = add_cols
# writing the column names first
with open(file,'w') as p:
p.write(','.join(col_names)+'\n')
# now appending final values in temp to the file as csv
with open(file,'a') as p:
np.savetxt(p,temp,delimiter=",",fmt="%i")
Now if your file is not comma separated and rather space separated, remove the delimiter option from all the functions as the delimiter is taken as space by default. Also join the first column accordingly.
After loading your csvs using read_csv, you can add the last 4 columns together and then overwrite them:
In [10]:
total = df[df.columns[-4:]].values + df1[df1.columns[-4:]].values
total
Out[10]:
array([[12, 12, 12, 12],
[14, 14, 14, 14],
[16, 16, 16, 16],
[18, 18, 18, 18],
[20, 20, 20, 20],
[22, 22, 22, 22],
[24, 24, 24, 24],
[26, 26, 26, 26],
[28, 28, 28, 28]], dtype=int64)
In [12]:
df[df.columns[-4:]] = total
df1[df1.columns[-4:]] = total
df
Out[12]:
a b c d e f g h i j k
0 1 1 1 1 1 1 1 12 12 12 12
1 2 2 2 2 2 2 2 14 14 14 14
2 3 3 3 3 3 3 3 16 16 16 16
3 4 4 4 4 4 4 4 18 18 18 18
4 5 5 5 5 5 5 5 20 20 20 20
5 6 6 6 6 6 6 6 22 22 22 22
6 7 7 7 7 7 7 7 24 24 24 24
7 8 8 8 8 8 8 8 26 26 26 26
8 9 9 9 9 9 9 9 28 28 28 28
In [13]:
df1
Out[13]:
a b c d e f g h i j k
0 11 11 11 11 11 11 11 12 12 12 12
1 12 12 12 12 12 12 12 14 14 14 14
2 13 13 13 13 13 13 13 16 16 16 16
3 14 14 14 14 14 14 14 18 18 18 18
4 15 15 15 15 15 15 15 20 20 20 20
5 16 16 16 16 16 16 16 22 22 22 22
6 17 17 17 17 17 17 17 24 24 24 24
7 18 18 18 18 18 18 18 26 26 26 26
8 19 19 19 19 19 19 19 28 28 28 28
We need to call the attribute .values here to return a np array because otherwise it will try to align on the index which in this case do not align.
Once you overwrite them call df.to_csv(file_path) and df1.to_csv(file_path)
In the case of your 8 dfs you can loop over them and aggregate whilst looping:
# take a copy of the firt df's last 4 columns
total = df_list[0]
total = total[total.columns[-4:]].values
for df in df_list[1:]:
total += df[df.columns[-4:]].values
Then just loop over your dfs again to overwrite:
for df in df_list:
df[df.columns[-4:]] = total
And then write out again using to_csv.

Categories

Resources