Making columns from a range of lines in pandas - python

I have a csv file like the below:
A, B
1,2
3,4
5,6
C,D
7,8
9,10
11,12
E,F
13,14
15,16
As you can see, and imagine, when I import this data using pd.read_csv, pandas creates the whole thing making with two columns (A,B) and a bunch of lines. It's correct because of the shape. However, I want do create various columns (A,B,C,D...). Fortunately, there're a blank space at the end of each "column", and I think that this could be used to separete theses lines in some way. However, I don't know how to proced with this.
The data:
https://raw.githubusercontent.com/AlessandroMDO/Dinamica_de_Voo/master/data.csv

It's normal behavior of pandas.read_csv, but usually data is not stored in csv files this way.
You can try to read the csv, strip extra spaces and split it by empty lines to parts first. Then read each part using pandas.read_csv and StringIO and concatenate them together using pandas.concat.
import pandas as pd
from io import StringIO
with open('test.csv', 'r') as f:
parts = f.read().strip().split('\n\n')
df = pd.concat([pd.read_csv(StringIO(part)) for part in parts], axis=1)
I have tried this with your csv:
Alpha Cd Alpha CL Alpha ... Cnp Alpha Cnr Alpha Clr
0 -14.0 0.08941 -14.0 -0.19430 -14.0 ... 0.0 -14.0 0.0 -14.0 0.0
1 -12.0 0.07646 -12.0 -0.17150 -12.0 ... 0.0 -12.0 0.0 -12.0 0.0
2 -10.0 0.06509 -10.0 -0.14710 -10.0 ... 0.0 -10.0 0.0 -10.0 0.0
3 -8.0 0.05545 -8.0 -0.12150 -8.0 ... 0.0 -8.0 0.0 -8.0 0.0
4 -6.0 0.04766 -6.0 -0.09479 -6.0 ... 0.0 -6.0 0.0 -6.0 0.0
5 -4.0 0.04181 -4.0 -0.06722 -4.0 ... 0.0 -4.0 0.0 -4.0 0.0
6 -2.0 0.03797 -2.0 -0.03905 -2.0 ... 0.0 -2.0 0.0 -2.0 0.0
7 0.0 0.03620 0.0 -0.01054 0.0 ... 0.0 0.0 0.0 0.0 0.0
8 2.0 0.03651 2.0 0.01806 2.0 ... 0.0 2.0 0.0 2.0 0.0
9 4.0 0.03960 4.0 0.05879 4.0 ... 0.0 4.0 0.0 4.0 0.0
10 6.0 0.04814 6.0 0.12650 6.0 ... 0.0 6.0 0.0 6.0 0.0
11 8.0 0.06494 8.0 0.22050 8.0 ... 0.0 8.0 0.0 8.0 0.0
12 10.0 0.09268 10.0 0.33960 10.0 ... 0.0 10.0 0.0 10.0 0.0
13 12.0 0.13390 12.0 0.48240 12.0 ... 0.0 12.0 0.0 12.0 0.0
14 14.0 0.19110 14.0 0.64710 14.0 ... 0.0 14.0 0.0 14.0 0.0
[15 rows x 36 columns]

Related

How can I merge my columns into a single one using a multiindex

I have a DataFrame looking like this:
year 2015 2016 2017 2018 2019 2015 2016 2017 2018 2019 ... 2015 2016 2017 2018 2019 2015 2016 2017 2018 2019
PATIENTS PATIENTS PATIENTS PATIENTS PATIENTS month month month month month ... diffs_24h diffs_24h diffs_24h diffs_24h diffs_24h diffs_168h diffs_168h diffs_168h diffs_168h diffs_168h
date
2016-01-01 00:00:00 0.0 2.0 1.0 7.0 3.0 1.0 1.0 1.0 1.0 1.0 ... NaN -1.0 -4.0 2.0 -2.0 NaN -3.0 -2.0 -3.0 -6.0
2016-01-01 01:00:00 6.0 6.0 7.0 6.0 7.0 1.0 1.0 1.0 1.0 1.0 ... NaN 4.0 0.0 0.0 1.0 NaN 3.0 1.0 2.0 -1.0
2016-01-01 02:00:00 2.0 7.0 6.0 2.0 3.0 1.0 1.0 1.0 1.0 1.0 ... NaN 4.0 3.0 -1.0 0.0 NaN 6.0 2.0 -3.0 0.0
2016-01-01 03:00:00 0.0 2.0 2.0 4.0 6.0 1.0 1.0 1.0 1.0 1.0 ... NaN -1.0 0.0 2.0 4.0 NaN -1.0 -2.0 3.0 3.0
2016-01-01 04:00:00 1.0 2.0 5.0 8.0 0.0 1.0 1.0 1.0 1.0 1.0 ... NaN -1.0 5.0 7.0 -1.0 NaN -2.0 3.0 5.0 -2.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2016-12-31 19:00:00 6.0 7.0 6.0 6.0 6.0 12.0 12.0 12.0 12.0 12.0 ... -9.0 -1.0 -7.0 1.0 -2.0 1.0 0.0 -6.0 -4.0 0.0
2016-12-31 20:00:00 2.0 2.0 5.0 5.0 3.0 12.0 12.0 12.0 12.0 12.0 ... -9.0 -7.0 -12.0 -1.0 -10.0 -2.0 -6.0 -2.0 -1.0 -4.0
2016-12-31 21:00:00 4.0 5.0 3.0 3.0 3.0 12.0 12.0 12.0 12.0 12.0 ... -2.0 -3.0 -10.0 -2.0 -11.0 -2.0 -2.0 -2.0 -3.0 -2.0
2016-12-31 22:00:00 5.0 2.0 6.0 6.0 3.0 12.0 12.0 12.0 12.0 12.0 ... 0.0 -6.0 -4.0 5.0 -4.0 2.0 -1.0 0.0 2.0 -3.0
2016-12-31 23:00:00 1.0 3.0 4.0 4.0 6.0 12.0 12.0 12.0 12.0 12.0 ... -6.0 -1.0 -11.0 2.0 -3.0 -4.0 -2.0 -7.0 -2.0 -2.0
and I want to end with a DataFrame in which the first level is the years but having a single year with all of the columns inside. How can I achieve that?
Example:
year 2015 2016 2017 2018 2019
PATIENTS month PATIENTS motnh PATIENTS month PATIENTS month PATIENTS month ...
date
2016-01-01 00:00:00 0.0 2.0 1.0 7.0 3.0 1.0 1.0 1.0 1.0 1.0 ... NaN -1.0 -4.0 2.0 -2.0 NaN -3.0 -2.0 -3.0 -6.0
2016-01-01 01:00:00 6.0 6.0 7.0 6.0 7.0 1.0 1.0 1.0 1.0 1.0 ... NaN 4.0 0.0 0.0 1.0 NaN 3.0 1.0 2.0 -1.0
2016-01-01 02:00:00 2.0 7.0 6.0 2.0 3.0 1.0 1.0 1.0 1.0 1.0 ... NaN 4.0 3.0 -1.0 0.0 NaN 6.0 2.0 -3.0 0.0
2016-01-01 03:00:00 0.0 2.0 2.0 4.0 6.0 1.0 1.0 1.0 1.0 1.0 ... NaN -1.0 0.0 2.0 4.0 NaN -1.0 -2.0 3.0 3.0
2016-01-01 04:00:00 1.0 2.0 5.0 8.0 0.0 1.0 1.0 1.0 1.0 1.0 ... NaN -1.0 5.0 7.0 -1.0 NaN -2.0 3.0 5.0 -2.0
... ... ... ... ... ... ... ... ... ... ... .
I think you only need sort your columns:
new_df = df.sort_index(axis=1, level=0)

Indexing dataframe doesn't return the correct values, and instead returns the cumulative value? [duplicate]

This question already has answers here:
How are iloc and loc different?
(6 answers)
Closed 2 years ago.
I have a dataframe named df_expanded that looks like this. The key column is 'A' and the important indices for this question are 29-31 (this is a modified version since the actual dataframe is huge):
>>> display(df_expanded)
A C Cl Co D E F Fa G Ga H HW
index
...
2.0 0.0 0.0 0.0 0.0 0.0 4.0 0.0 11.0 2.0 8.0 0.0 0.0
2.0 0.0 0.0 0.0 0.0 0.0 4.2 0.0 11.8 2.4 8.6 0.0 0.0
2.0 0.0 0.0 0.0 0.0 0.0 4.4 0.0 12.6 2.8 9.2 0.0 0.0
2.0 0.0 0.0 0.0 0.0 0.0 4.6 0.0 13.4 3.2 9.8 0.0 0.0
2.0 0.0 0.0 0.0 0.0 0.0 4.8 0.0 14.2 3.6 10.4 0.0 0.0
3.0 0.0 0.0 0.0 0.0 0.0 5.0 0.0 15.0 4.0 11.0 0.0 0.0
3.0 0.4 0.0 0.0 0.0 0.0 5.2 0.0 16.0 4.0 11.6 0.0 0.0
3.0 0.8 0.0 0.0 0.0 0.0 5.4 0.0 17.0 4.0 12.2 0.0 0.0
3.0 1.2 0.0 0.0 0.0 0.0 5.6 0.0 18.0 4.0 12.8 0.0 0.0
3.0 1.6 0.0 0.0 0.0 0.0 5.8 0.0 19.0 4.0 13.4 0.0 0.0
4.0 2.0 0.0 0.0 0.0 0.0 6.0 0.0 20.0 4.0 14.0 0.0 0.0
4.0 2.0 0.0 0.0 0.0 0.0 6.0 0.0 21.2 4.0 14.4 0.0 0.0
4.0 2.0 0.0 0.0 0.0 0.0 6.0 0.0 22.4 4.0 14.8 0.0 0.0
4.0 2.0 0.0 0.0 0.0 0.0 6.0 0.0 23.6 4.0 15.2 0.0 0.0
4.0 2.0 0.0 0.0 0.0 0.0 6.0 0.0 24.8 4.0 15.6 0.0 0.0
5.0 2.0 0.0 0.0 0.0 0.0 6.0 0.0 26.0 4.0 16.0 0.0 0.0
5.0 2.0 0.0 0.0 0.0 0.0 6.2 0.0 27.4 4.0 16.6 0.0 0.0
5.0 2.0 0.0 0.0 0.0 0.0 6.4 0.0 28.8 4.0 17.2 0.0 0.0
5.0 2.0 0.0 0.0 0.0 0.0 6.6 0.0 30.2 4.0 17.8 0.0 0.0
5.0 2.0 0.0 0.0 0.0 0.0 6.8 0.0 31.6 4.0 18.4 0.0 0.0
6.0 2.0 0.0 0.0 0.0 0.0 7.0 0.0 33.0 4.0 19.0 0.0 0.0
6.0 2.0 0.0 0.0 0.0 0.0 7.0 1.0 33.4 4.0 19.2 0.0 0.0
6.0 2.0 0.0 0.0 0.0 0.0 7.0 2.0 33.8 4.0 19.4 0.0 0.0
6.0 2.0 0.0 0.0 0.0 0.0 7.0 3.0 34.2 4.0 19.6 0.0 0.0
6.0 2.0 0.0 0.0 0.0 0.0 7.0 4.0 34.6 4.0 19.8 0.0 0.0
7.0 2.0 0.0 0.0 0.0 0.0 7.0 5.0 35.0 4.0 20.0 0.0 0.0
7.0 2.0 0.0 0.0 0.0 0.0 7.0 5.0 36.2 4.0 20.4 0.0 0.0
7.0 2.0 0.0 0.0 0.0 0.0 7.0 5.0 37.4 4.0 20.8 0.0 0.0
7.0 2.0 0.0 0.0 0.0 0.0 7.0 5.0 38.6 4.0 21.2 0.0 0.0
7.0 2.0 0.0 0.0 0.0 0.0 7.0 5.0 39.8 4.0 21.6 0.0 0.0
8.0 2.0 0.0 0.0 0.0 0.0 7.0 5.0 41.0 4.0 22.0 0.0 0.0
8.0 2.0 0.0 0.0 0.0 0.0 7.2 5.0 41.0 4.0 22.2 0.0 0.0
8.0 2.0 0.0 0.0 0.0 0.0 7.4 5.0 41.0 4.0 22.4 0.0 0.0
8.0 2.0 0.0 0.0 0.0 0.0 7.6 5.0 41.0 4.0 22.6 0.0 0.0
8.0 2.0 0.0 0.0 0.0 0.0 7.8 5.0 41.0 4.0 22.8 0.0 0.0
9.0 2.0 0.0 0.0 0.0 0.0 8.0 5.0 41.0 4.0 23.0 0.0 0.0
9.0 2.0 0.0 0.0 0.0 0.0 8.0 5.0 41.6 4.0 23.4 0.0 0.0
9.0 2.0 0.0 0.0 0.0 0.0 8.0 5.0 42.2 4.0 23.8 0.0 0.0
9.0 2.0 0.0 0.0 0.0 0.0 8.0 5.0 42.8 4.0 24.2 0.0 0.0
9.0 2.0 0.0 0.0 0.0 0.0 8.0 5.0 43.4 4.0 24.6 0.0 0.0
10.0 2.0 0.0 0.0 0.0 0.0 8.0 5.0 44.0 4.0 25.0 0.0 0.0
10.0 2.0 0.0 0.0 0.0 0.0 8.0 5.0 45.2 4.0 25.6 0.0 0.0
10.0 2.0 0.0 0.0 0.0 0.0 8.0 5.0 46.4 4.0 26.2 0.0 0.0
10.0 2.0 0.0 0.0 0.0 0.0 8.0 5.0 47.6 4.0 26.8 0.0 0.0
10.0 2.0 0.0 0.0 0.0 0.0 8.0 5.0 48.8 4.0 27.4 0.0 0.0
11.0 2.0 0.0 0.0 0.0 0.0 8.0 5.0 50.0 4.0 28.0 0.0 0.0
11.0 2.0 0.0 0.0 0.0 0.0 8.0 5.0 50.4 4.4 28.4 0.2 0.0
11.0 2.0 0.0 0.0 0.0 0.0 8.0 5.0 50.8 4.8 28.8 0.4 0.0
11.0 2.0 0.0 0.0 0.0 0.0 8.0 5.0 51.2 5.2 29.2 0.6 0.0
11.0 2.0 0.0 0.0 0.0 0.0 8.0 5.0 51.6 5.6 29.6 0.8 0.0
12.0 2.0 0.0 0.0 0.0 0.0 8.0 5.0 52.0 6.0 30.0 1.0 0.0
12.0 2.0 0.0 0.0 0.0 0.0 8.0 5.8 52.6 6.0 30.6 1.0 0.0
12.0 2.0 0.0 0.0 0.0 0.0 8.0 6.6 53.2 6.0 31.2 1.0 0.0
12.0 2.0 0.0 0.0 0.0 0.0 8.0 7.4 53.8 6.0 31.8 1.0 0.0
12.0 2.0 0.0 0.0 0.0 0.0 8.0 8.2 54.4 6.0 32.4 1.0 0.0
13.0 2.0 0.0 0.0 0.0 0.0 8.0 9.0 55.0 6.0 33.0 1.0 0.0
13.0 2.0 0.0 0.0 0.0 0.0 8.0 10.0 55.4 6.0 33.6 1.0 0.0
13.0 2.0 0.0 0.0 0.0 0.0 8.0 11.0 55.8 6.0 34.2 1.0 0.0
13.0 2.0 0.0 0.0 0.0 0.0 8.0 12.0 56.2 6.0 34.8 1.0 0.0
13.0 2.0 0.0 0.0 0.0 0.0 8.0 13.0 56.6 6.0 35.4 1.0 0.0
14.0 2.0 0.0 0.0 0.0 0.0 8.0 14.0 57.0 6.0 36.0 1.0 0.0
14.0 2.0 0.0 0.0 0.0 0.0 8.0 15.6 57.0 6.4 36.2 1.0 0.0
14.0 2.0 0.0 0.0 0.0 0.0 8.0 17.2 57.0 6.8 36.4 1.0 0.0
14.0 2.0 0.0 0.0 0.0 0.0 8.0 18.8 57.0 7.2 36.6 1.0 0.0
14.0 2.0 0.0 0.0 0.0 0.0 8.0 20.4 57.0 7.6 36.8 1.0 0.0
15.0 2.0 0.0 0.0 0.0 0.0 8.0 22.0 57.0 8.0 37.0 1.0 0.0
15.0 2.0 0.0 0.2 0.0 0.0 8.0 22.6 58.0 8.4 37.0 1.0 0.0
15.0 2.0 0.0 0.4 0.0 0.0 8.0 23.2 59.0 8.8 37.0 1.0 0.0
15.0 2.0 0.0 0.6 0.0 0.0 8.0 23.8 60.0 9.2 37.0 1.0 0.0
15.0 2.0 0.0 0.8 0.0 0.0 8.0 24.4 61.0 9.6 37.0 1.0 0.0
16.0 2.0 0.0 1.0 0.0 0.0 8.0 25.0 62.0 10.0 37.0 1.0 0.0
16.0 2.0 0.0 1.0 0.0 0.0 8.2 25.0 63.2 10.0 37.0 1.0 0.0
16.0 2.0 0.0 1.0 0.0 0.0 8.4 25.0 64.4 10.0 37.0 1.0 0.0
16.0 2.0 0.0 1.0 0.0 0.0 8.6 25.0 65.6 10.0 37.0 1.0 0.0
16.0 2.0 0.0 1.0 0.0 0.0 8.8 25.0 66.8 10.0 37.0 1.0 0.0
17.0 2.0 0.0 1.0 0.0 0.0 9.0 25.0 68.0 10.0 37.0 1.0 0.0
17.0 2.0 0.0 1.2 0.0 0.0 9.0 26.2 68.4 10.4 37.2 1.0 0.0
17.0 2.0 0.0 1.4 0.0 0.0 9.0 27.4 68.8 10.8 37.4 1.0 0.0
17.0 2.0 0.0 1.6 0.0 0.0 9.0 28.6 69.2 11.2 37.6 1.0 0.0
17.0 2.0 0.0 1.8 0.0 0.0 9.0 29.8 69.6 11.6 37.8 1.0 0.0
18.0 2.0 0.0 2.0 0.0 0.0 9.0 31.0 70.0 12.0 38.0 1.0 0.0
18.0 2.0 0.0 2.0 0.0 0.0 9.0 31.8 70.0 12.0 38.0 1.0 0.0
18.0 2.0 0.0 2.0 0.0 0.0 9.0 32.6 70.0 12.0 38.0 1.0 0.0
18.0 2.0 0.0 2.0 0.0 0.0 9.0 33.4 70.0 12.0 38.0 1.0 0.0
18.0 2.0 0.0 2.0 0.0 0.0 9.0 34.2 70.0 12.0 38.0 1.0 0.0
19.0 2.0 0.0 2.0 0.0 0.0 9.0 35.0 70.0 12.0 38.0 1.0 0.0
19.0 2.0 0.6 2.0 0.0 0.0 9.4 35.2 70.0 12.4 38.0 1.0 0.0
19.0 2.0 1.2 2.0 0.0 0.0 9.8 35.4 70.0 12.8 38.0 1.0 0.0
19.0 2.0 1.8 2.0 0.0 0.0 10.2 35.6 70.0 13.2 38.0 1.0 0.0
19.0 2.0 2.4 2.0 0.0 0.0 10.6 35.8 70.0 13.6 38.0 1.0 0.0
20.0 2.0 3.0 2.0 0.0 0.0 11.0 36.0 70.0 14.0 38.0 1.0 0.0
20.0 2.0 3.4 2.0 0.0 0.0 11.4 36.0 70.0 14.2 38.4 1.0 0.0
20.0 2.0 3.8 2.0 0.0 0.0 11.8 36.0 70.0 14.4 38.8 1.0 0.0
20.0 2.0 4.2 2.0 0.0 0.0 12.2 36.0 70.0 14.6 39.2 1.0 0.0
20.0 2.0 4.6 2.0 0.0 0.0 12.6 36.0 70.0 14.8 39.6 1.0 0.0
21.0 2.0 5.0 2.0 0.0 0.0 13.0 36.0 70.0 15.0 40.0 1.0 0.0
21.0 2.0 5.6 2.0 0.0 0.0 13.2 36.6 70.0 15.4 40.0 1.0 0.0
21.0 2.0 6.2 2.0 0.0 0.0 13.4 37.2 70.0 15.8 40.0 1.0 0.0
21.0 2.0 6.8 2.0 0.0 0.0 13.6 37.8 70.0 16.2 40.0 1.0 0.0
21.0 2.0 7.4 2.0 0.0 0.0 13.8 38.4 70.0 16.6 40.0 1.0 0.0
22.0 2.0 8.0 2.0 0.0 0.0 14.0 39.0 70.0 17.0 40.0 1.0 0.0
22.0 2.0 8.2 2.2 0.0 0.0 14.0 40.2 70.4 17.0 40.4 1.0 0.0
22.0 2.0 8.4 2.4 0.0 0.0 14.0 41.4 70.8 17.0 40.8 1.0 0.0
22.0 2.0 8.6 2.6 0.0 0.0 14.0 42.6 71.2 17.0 41.2 1.0 0.0
22.0 2.0 8.8 2.8 0.0 0.0 14.0 43.8 71.6 17.0 41.6 1.0 0.0
23.0 2.0 9.0 3.0 0.0 0.0 14.0 45.0 72.0 17.0 42.0 1.0 0.0
23.0 2.0 9.0 3.0 0.0 0.0 14.2 45.6 72.0 17.6 42.4 1.0 0.0
23.0 2.0 9.0 3.0 0.0 0.0 14.4 46.2 72.0 18.2 42.8 1.0 0.0
23.0 2.0 9.0 3.0 0.0 0.0 14.6 46.8 72.0 18.8 43.2 1.0 0.0
23.0 2.0 9.0 3.0 0.0 0.0 14.8 47.4 72.0 19.4 43.6 1.0 0.0
24.0 2.0 9.0 3.0 0.0 0.0 15.0 48.0 72.0 20.0 44.0 1.0 0.0
24.0 2.0 9.0 3.0 0.0 0.0 15.2 48.0 72.0 20.6 44.2 1.0 0.0
24.0 2.0 9.0 3.0 0.0 0.0 15.4 48.0 72.0 21.2 44.4 1.0 0.0
24.0 2.0 9.0 3.0 0.0 0.0 15.6 48.0 72.0 21.8 44.6 1.0 0.0
24.0 2.0 9.0 3.0 0.0 0.0 15.8 48.0 72.0 22.4 44.8 1.0 0.0
25.0 2.0 9.0 3.0 0.0 0.0 16.0 48.0 72.0 23.0 45.0 1.0 0.0
25.0 2.0 9.4 3.0 0.0 0.0 16.4 48.0 72.0 23.2 45.0 1.0 0.4
25.0 2.0 9.8 3.0 0.0 0.0 16.8 48.0 72.0 23.4 45.0 1.0 0.8
25.0 2.0 10.2 3.0 0.0 0.0 17.2 48.0 72.0 23.6 45.0 1.0 1.2
25.0 2.0 10.6 3.0 0.0 0.0 17.6 48.0 72.0 23.8 45.0 1.0 1.6
26.0 2.0 11.0 3.0 0.0 0.0 18.0 48.0 72.0 24.0 45.0 1.0 2.0
26.0 2.0 11.6 3.0 0.2 0.0 18.2 48.0 72.0 24.0 45.0 1.0 2.6
26.0 2.0 12.2 3.0 0.4 0.0 18.4 48.0 72.0 24.0 45.0 1.0 3.2
26.0 2.0 12.8 3.0 0.6 0.0 18.6 48.0 72.0 24.0 45.0 1.0 3.8
26.0 2.0 13.4 3.0 0.8 0.0 18.8 48.0 72.0 24.0 45.0 1.0 4.4
27.0 2.0 14.0 3.0 1.0 0.0 19.0 48.0 72.0 24.0 45.0 1.0 5.0
27.0 2.0 14.4 3.0 1.0 0.0 19.4 48.0 72.4 24.4 45.2 1.0 5.4
27.0 2.0 14.8 3.0 1.0 0.0 19.8 48.0 72.8 24.8 45.4 1.0 5.8
27.0 2.0 15.2 3.0 1.0 0.0 20.2 48.0 73.2 25.2 45.6 1.0 6.2
27.0 2.0 15.6 3.0 1.0 0.0 20.6 48.0 73.6 25.6 45.8 1.0 6.6
28.0 2.0 16.0 3.0 1.0 0.0 21.0 48.0 74.0 26.0 46.0 1.0 7.0
28.0 2.0 16.6 3.0 1.0 0.0 21.2 48.2 74.0 26.6 46.0 1.0 7.2
28.0 2.0 17.2 3.0 1.0 0.0 21.4 48.4 74.0 27.2 46.0 1.0 7.4
28.0 2.0 17.8 3.0 1.0 0.0 21.6 48.6 74.0 27.8 46.0 1.0 7.6
28.0 2.0 18.4 3.0 1.0 0.0 21.8 48.8 74.0 28.4 46.0 1.0 7.8
29.0 2.0 19.0 3.0 1.0 0.0 22.0 49.0 74.0 29.0 46.0 1.0 8.0
29.0 2.0 19.2 3.4 1.0 0.0 22.0 50.0 74.0 29.0 46.4 1.0 8.0
29.0 2.0 19.4 3.8 1.0 0.0 22.0 51.0 74.0 29.0 46.8 1.0 8.0
29.0 2.0 19.6 4.2 1.0 0.0 22.0 52.0 74.0 29.0 47.2 1.0 8.0
29.0 2.0 19.8 4.6 1.0 0.0 22.0 53.0 74.0 29.0 47.6 1.0 8.0
30.0 2.0 20.0 5.0 1.0 0.0 22.0 54.0 74.0 29.0 48.0 1.0 8.0
30.0 1.6 16.0 4.2 0.8 0.0 17.8 44.2 59.6 23.8 38.4 0.8 6.8
30.0 1.2 12.0 3.4 0.6 0.0 13.6 34.4 45.2 18.6 28.8 0.6 5.6
30.0 0.8 8.0 2.6 0.4 0.0 9.4 24.6 30.8 13.4 19.2 0.4 4.4
30.0 0.4 4.0 1.8 0.2 0.0 5.2 14.8 16.4 8.2 9.6 0.2 3.2
31.0 0.0 0.0 1.0 0.0 0.0 1.0 5.0 2.0 3.0 0.0 0.0 2.0
31.0 0.0 0.0 1.0 0.0 0.0 1.2 5.4 2.2 3.6 0.0 0.0 2.8
31.0 0.0 0.0 1.0 0.0 0.0 1.4 5.8 2.4 4.2 0.0 0.0 3.6
31.0 0.0 0.0 1.0 0.0 0.0 1.6 6.2 2.6 4.8 0.0 0.0 4.4
31.0 0.0 0.0 1.0 0.0 0.0 1.8 6.6 2.8 5.4 0.0 0.0 5.2
...
When indexing this dataframe, I feel like df_expanded.iloc[31] should return the following:
A C Cl Co D E F Fa G Ga H HW
index
31.0 0.0 0.0 1.0 0.0 0.0 1.0 5.0 2.0 3.0 0.0 0.0 2.0
31.0 0.0 0.0 1.0 0.0 0.0 1.2 5.4 2.2 3.6 0.0 0.0 2.8
31.0 0.0 0.0 1.0 0.0 0.0 1.4 5.8 2.4 4.2 0.0 0.0 3.6
31.0 0.0 0.0 1.0 0.0 0.0 1.6 6.2 2.6 4.8 0.0 0.0 4.4
31.0 0.0 0.0 1.0 0.0 0.0 1.8 6.6 2.8 5.4 0.0 0.0 5.2
However, the following is returned:
>>> print(df_expanded.iloc[31])
A 2.0
C 0.0
Cl 0.0
Co 0.0
D 0.0
E 7.0
F 1.0
Fa 33.4
G 4.0
Ga 19.2
H 0.0
HW 0.0
Why is it that indexing the 31st index returns 2.0 for A (and the cumulative values for other columns as well) instead of what is shown when df_expanded is displayed? I can't figure out why it's working like this, so any kind of help would be greatly appreciated!
df_expanded.iloc[31]
does not return what you expect because it return the 31st row in your dataset, you can check your 31st row, and that is exactly what you get.
For what you want, instead try
df_expanded.loc[31]

How to read a csv with rows of NUL, ('\x00'), into pandas?

I have a set of csv files with Date and Time as the first two columns (no headers in the files). The files open up fine in Excel but when I try to read them into Python using Pandas read_csv, only the first Date is returned, whether or not I try a type conversion.
When I open in Notepad, it's not simply comma separated and has loads of space before each line after line 1; I have tried skipinitialspace = True to no avail
I have also tried various type conversions but none work. I am currently using parse_dates = [['Date','Time']], infer_datetime_format = True, dayfirst = True
Example output (no conversion):
0 1 2 3 4 ... 12 13 14 15 16
0 02/03/20 15:13:39 5.5 5.8 42.84 ... 30.0 79.0 0.0 0.0 0.0
1 NaN 15:13:49 5.5 5.8 42.84 ... 30.0 79.0 0.0 0.0 0.0
2 NaN 15:13:59 5.5 5.7 34.26 ... 30.0 79.0 0.0 0.0 0.0
3 NaN 15:14:09 5.5 5.7 34.26 ... 30.0 79.0 0.0 0.0 0.0
4 NaN 15:14:19 5.5 5.4 17.10 ... 30.0 79.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ...
39451 NaN 01:14:27 5.5 8.4 60.00 ... 30.0 68.0 0.0 0.0 0.0
39452 NaN 01:14:37 5.5 8.4 60.00 ... 30.0 68.0 0.0 0.0 0.0
39453 NaN 01:14:47 5.5 8.4 60.00 ... 30.0 68.0 0.0 0.0 0.0
39454 NaN 01:14:57 5.5 8.4 60.00 ... 30.0 68.0 0.0 0.0 0.0
39455 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN
And with parse_dates etc:
Date_Time pH1 SP pH Ph1 PV pH ... 1 2 3
0 02/03/20 15:13:39 5.5 5.8 ... 0.0 0.0 0.0
1 nan 15:13:49 5.5 5.8 ... 0.0 0.0 0.0
2 nan 15:13:59 5.5 5.7 ... 0.0 0.0 0.0
3 nan 15:14:09 5.5 5.7 ... 0.0 0.0 0.0
4 nan 15:14:19 5.5 5.4 ... 0.0 0.0 0.0
... ... ... ... ... ... ... ...
39451 nan 01:14:27 5.5 8.4 ... 0.0 0.0 0.0
39452 nan 01:14:37 5.5 8.4 ... 0.0 0.0 0.0
39453 nan 01:14:47 5.5 8.4 ... 0.0 0.0 0.0
39454 nan 01:14:57 5.5 8.4 ... 0.0 0.0 0.0
39455 nan nan NaN NaN ... NaN NaN NaN
Data copied from Notepad (there is actually more whitespace in front of each line but it wouldn't work here):
Data from 67.csv
02/03/20,15:13:39,5.5,5.8,42.84,7.2,6.8,10.63,60.0,0.0,300,1,30,79,0.0,0.0, 0.0
02/03/20,15:13:49,5.5,5.8,42.84,7.2,6.8,10.63,60.0,0.0,300,1,30,79,0.0,0.0, 0.0
02/03/20,15:13:59,5.5,5.7,34.26,7.2,6.8,10.63,60.0,22.3,300,1,30,79,0.0,0.0, 0.0
02/03/20,15:14:09,5.5,5.7,34.26,7.2,6.8,10.63,60.0,15.3,300,45,30,79,0.0,0.0, 0.0
02/03/20,15:14:19,5.5,5.4,17.10,7.2,6.8,10.63,60.0,50.2,300,86,30,79,0.0,0.0, 0.0
And in Excel (so I know the information is there and readable):
Code
import sys
import numpy as np
import pandas as pd
from datetime import datetime
from tkinter import filedialog
from tkinter import *
def import_file(filename):
print('\nOpening ' + filename + ":")
##Read the data in the file
df = pd.read_csv(filename, header = None, low_memory = False)
print(df)
df['Date_Time'] = pd.to_datetime(df[0] + ' ' + df[1])
df.drop(columns=[0, 1], inplace=True)
print(df)
filenames=[]
print('Select files to read, Ctrl or Shift for Multiples')
TkWindow = Tk()
TkWindow.withdraw() # we don't want a full GUI, so keep the root window from appearing
## Show an "Open" dialog box and return the path to the selected file
filenames = filedialog.askopenfilename(title='Open data file', filetypes=(("Comma delimited", "*.csv"),), multiple=True)
TkWindow.destroy()
if len(filenames) == 0:
print('No files selected - Exiting program.')
sys.exit()
else:
print('\n'.join(filenames))
##Read the data from the specified file/s
print('\nReading data file/s')
dfs=[]
for filename in filenames:
dfs.append(import_file(filename))
if len(dfs) > 1:
print('\nCombining data files.')
The file is filled with NUL, '\x00', which needs to be removed.
Use pandas.DataFrame to load the data from d, after the rows have been cleaned.
import pandas as pd
import string # to make column names
# the issue is the the file is filled with NUL not whitespace
def import_file(filename):
# open the file and clean it
with open(filename) as f:
d = list(f.readlines())
# replace NUL, strip whitespace from the end of the strings, split each string into a list
d = [v.replace('\x00', '').strip().split(',') for v in d]
# remove some empty rows
d = [v for v in d if len(v) > 2]
# load the file with pandas
df = pd.DataFrame(d)
# convert column 0 and 1 to a datetime
df['datetime'] = pd.to_datetime(df[0] + ' ' + df[1])
# drop column 0 and 1
df.drop(columns=[0, 1], inplace=True)
# set datetime as the index
df.set_index('datetime', inplace=True)
# convert data in columns to floats
df = df.astype('float')
# give character column names
df.columns = list(string.ascii_uppercase)[:len(df.columns)]
# reset the index
df.reset_index(inplace=True)
return df.copy()
# call the function
dfs = list()
filenames = ['67.csv']
for filename in filenames:
dfs.append(import_file(filename))
display(df)
A B C D E F G H I J K L M N O
datetime
2020-02-03 15:13:39 5.5 5.8 42.84 7.2 6.8 10.63 60.0 0.0 300.0 1.0 30.0 79.0 0.0 0.0 0.0
2020-02-03 15:13:49 5.5 5.8 42.84 7.2 6.8 10.63 60.0 0.0 300.0 1.0 30.0 79.0 0.0 0.0 0.0
2020-02-03 15:13:59 5.5 5.7 34.26 7.2 6.8 10.63 60.0 22.3 300.0 1.0 30.0 79.0 0.0 0.0 0.0
2020-02-03 15:14:09 5.5 5.7 34.26 7.2 6.8 10.63 60.0 15.3 300.0 45.0 30.0 79.0 0.0 0.0 0.0
2020-02-03 15:14:19 5.5 5.4 17.10 7.2 6.8 10.63 60.0 50.2 300.0 86.0 30.0 79.0 0.0 0.0 0.0

Counting consonants and vowels in a split string

I read in a .csv file. I have the following data frame that counts vowels and consonants in a string in the column Description. This works great, but my problem is I want to split Description into 8 columns and count the consonants and vowels for each column. The second part of my code allows for me to split Description into 8 columns. How can I count the vowels and consonants on all 8 columns the Description is split into?
import pandas as pd
import re
def anti_vowel(s):
result = re.sub(r'[AEIOU]', '', s, flags=re.IGNORECASE)
return result
data = pd.read_csv('http://core.secure.ehc.com/src/util/detail-price-list/TristarDivision_SummitMedicalCenter_CM.csv')
data.dropna(inplace = True)
data['Vowels'] = data['Description'].str.count(r'[aeiou]', flags=re.I)
data['Consonant'] = data['Description'].str.count(r'[bcdfghjklmnpqrstvwxzy]', flags=re.I)
print (data)
This is the code I'm using to split the column Description into 8 columns.
import pandas as pd
data = data["Description"].str.split(" ", n = 8, expand = True)
data = pd.read_csv('http://core.secure.ehc.com/src/util/detail-price-list/TristarDivision_SummitMedicalCenter_CM.csv')
data.dropna(inplace = True)
data = data["Description"].str.split(" ", n = 8, expand = True)
print (data)
Now how can I put it all together?
In order to read each column of the 8 and count consonants I know i can use the following replacing the 0 with 0-7:
testconsonant = data[0].str.count(r'[bcdfghjklmnpqrstvwxzy]', flags=re.I)
testvowel = data[0].str.count(r'[aeiou]', flags=re.I)
Desired output would be:
Description [0] vowel count consonant count Description [1] vowel count consonant count Description [2] vowel count consonant count Description [3] vowel count consonant count Description [4] vowel count consonant count all the way to description [7]
stack then unstack
stacked = data.stack()
pd.concat({
'Vowels': stacked.str.count('[aeiou]', flags=re.I),
'Consonant': stacked.str.count('[bcdfghjklmnpqrstvwxzy]', flags=re.I)
}, axis=1).unstack()
Consonant Vowels
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 3.0 5.0 5.0 1.0 2.0 NaN NaN NaN NaN 1.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN
1 8.0 5.0 1.0 0.0 0.0 0.0 0.0 0.0 NaN 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN
2 8.0 5.0 1.0 0.0 0.0 0.0 0.0 0.0 NaN 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN
3 8.0 5.0 1.0 0.0 0.0 0.0 0.0 0.0 NaN 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN
4 3.0 5.0 3.0 1.0 0.0 0.0 0.0 0.0 NaN 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 NaN
5 3.0 5.0 3.0 1.0 0.0 0.0 0.0 0.0 NaN 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 NaN
6 3.0 4.0 0.0 1.0 0.0 0.0 0.0 NaN NaN 3.0 1.0 0.0 0.0 0.0 0.0 0.0 NaN NaN
7 3.0 3.0 0.0 1.0 0.0 0.0 0.0 NaN NaN 3.0 1.0 0.0 1.0 0.0 0.0 0.0 NaN NaN
8 3.0 3.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 3.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
9 3.0 3.0 0.0 1.0 0.0 0.0 0.0 NaN NaN 3.0 1.0 0.0 1.0 0.0 0.0 0.0 NaN NaN
10 3.0 3.0 0.0 1.0 0.0 0.0 0.0 0.0 NaN 3.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN
11 3.0 3.0 0.0 2.0 2.0 NaN NaN NaN NaN 3.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN
12 3.0 3.0 0.0 1.0 0.0 0.0 0.0 0.0 NaN 3.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN
13 3.0 3.0 0.0 2.0 2.0 NaN NaN NaN NaN 3.0 1.0 0.0 0.0 0.0 NaN NaN NaN NaN
14 3.0 5.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 3.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
15 3.0 3.0 0.0 3.0 1.0 NaN NaN NaN NaN 3.0 0.0 0.0 0.0 1.0 NaN NaN NaN NaN
If you want to combine this with the data dataframe, you can do:
stacked = data.stack()
pd.concat({
'Data': data,
'Vowels': stacked.str.count('[aeiou]', flags=re.I),
'Consonant': stacked.str.count('[bcdfghjklmnpqrstvwxzy]', flags=re.I)
}, axis=1).unstack()

How to reindex data frame in Pandas?

I'm using pandas in Python, and I have performed some crosstab calculations and concatenations, and at the end up with a data frame that looks like this:
ID 5 6 7 8 9 10 11 12 13
Total 87.0 3.0 9.0 6.0 92.0 7.0 3.0 3.0 20.0
Regular 72.0 2.0 8.0 5.0 81.0 7.0 3.0 3.0 18.0
CR 22.0 0.0 0.0 0.0 17.0 0.0 0.0 0.0 3.0
HDG 20.0 0.0 0.0 0.0 24.0 4.0 0.0 0.0 1.0
PPG 30.0 2.0 8.0 5.0 40.0 3.0 3.0 3.0 14.0
Superior 15.0 1.0 1.0 1.0 11.0 0.0 0.0 0.0 2.0
CR 3.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0
HDG 5.0 1.0 1.0 1.0 4.0 0.0 0.0 0.0 0.0
PPG 7.0 0.0 0.0 0.0 5.0 0.0 0.0 0.0 2.0
The problem is that I want the last 4 rows, that start with Superior to be places before Total row. So, simply I want to switch the positions of last 4 rows with the 4 rows that start with Regular. How can I achieve this in pandas? So that I get this:
ID 5 6 7 8 9 10 11 12 13
Total 87.0 3.0 9.0 6.0 92.0 7.0 3.0 3.0 20.0
Superior 15.0 1.0 1.0 1.0 11.0 0.0 0.0 0.0 2.0
CR 3.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0
HDG 5.0 1.0 1.0 1.0 4.0 0.0 0.0 0.0 0.0
PPG 7.0 0.0 0.0 0.0 5.0 0.0 0.0 0.0 2.0
Regular 72.0 2.0 8.0 5.0 81.0 7.0 3.0 3.0 18.0
CR 22.0 0.0 0.0 0.0 17.0 0.0 0.0 0.0 3.0
HDG 20.0 0.0 0.0 0.0 24.0 4.0 0.0 0.0 1.0
PPG 30.0 2.0 8.0 5.0 40.0 3.0 3.0 3.0 14.0
More generalized solution Categorical and argsort, I know this df was ordered , so ffill is safe here
s=df.ID
s=s.where(s.isin(['Total','Regular','Superior'])).ffill()
s=pd.Categorical(s,['Total','Superior','Regular'],ordered=True)
df=df.iloc[np.argsort(s)]
df
Out[188]:
ID 5 6 7 8 9 10 11 12 13
0 Total 87.0 3.0 9.0 6.0 92.0 7.0 3.0 3.0 20.0
5 Superior 15.0 1.0 1.0 1.0 11.0 0.0 0.0 0.0 2.0
6 CR 3.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0
7 HDG 5.0 1.0 1.0 1.0 4.0 0.0 0.0 0.0 0.0
8 PPG 7.0 0.0 0.0 0.0 5.0 0.0 0.0 0.0 2.0
1 Regular 72.0 2.0 8.0 5.0 81.0 7.0 3.0 3.0 18.0
2 CR 22.0 0.0 0.0 0.0 17.0 0.0 0.0 0.0 3.0
3 HDG 20.0 0.0 0.0 0.0 24.0 4.0 0.0 0.0 1.0
4 PPG 30.0 2.0 8.0 5.0 40.0 3.0 3.0 3.0 14.0
Here's one way:
import numpy as np
df.iloc[1:,:] = np.roll(df.iloc[1:,:].values, 4, axis=0)
ID 5 6 7 8 9 10 11 12 13
0 Total 87.0 3.0 9.0 6.0 92.0 7.0 3.0 3.0 20.0
1 Superior 15.0 1.0 1.0 1.0 11.0 0.0 0.0 0.0 2.0
2 CR 3.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0
3 HDG 5.0 1.0 1.0 1.0 4.0 0.0 0.0 0.0 0.0
4 PPG 7.0 0.0 0.0 0.0 5.0 0.0 0.0 0.0 2.0
5 Regular 72.0 2.0 8.0 5.0 81.0 7.0 3.0 3.0 18.0
6 CR 22.0 0.0 0.0 0.0 17.0 0.0 0.0 0.0 3.0
7 HDG 20.0 0.0 0.0 0.0 24.0 4.0 0.0 0.0 1.0
8 PPG 30.0 2.0 8.0 5.0 40.0 3.0 3.0 3.0 14.0
For a specific answer to this question, just use iloc
df.iloc[[0,5,6,7,8,1,2,3,4],:]
For a more generalized solution,
m = (df.ID.eq('Superior') | df.ID.eq('Regular')).cumsum()
pd.concat([df[m==0], df[m==2], df[m==1]])
or
order = (2,1)
pd.concat([df[m==0], *[df[m==c] for c in order]])
where order defines the mapping from previous ordering to new ordering.

Categories

Resources