Pandas read csv with multiple whitespaces and parse dates - python

I have a csv file that looks like
Year Mo Da (01,52)
1950 1 1 0.00
1950 1 2 0.00
1950 1 3 0.05
1950 1 4 0.00
1950 1 5 0.07
1950 1 6 0.07
and I would like transform it into a dataframe with 2 columns: a datetime column of YYYYMMDD (using the "Year", "Mo", and "Da" columns in the raw data) and then the rainfall at the grid point (e.g. 01, 52) as the second column.
A desired output would be:
Datetime Rainfall
19500101 0.00
19500102 0.00
19500103 0.05
I am stuck on two issues: appropriately accounting for the whitespace during the read-in and properly using parse_dates.
The simple read-in command:
df = pd.read_csv(csv_fl)
Almost correctly reads in the headers, but splits the (01,52) into separate columns, yielding a trailing NaN, which shouldn't be there.
Year Mo Da (01 52)
0 1950 1 1 0.00 NaN
And trying to parse the dates using
df = pd.read_csv(csv_fl, parse_dates={'Datetime':[0,1,2]}, index_col=0)
leads to an IndexError
colnames.append(str(columns[c]))
IndexError: list index out of range
Any guidance is much appreciated.

If you pass params delim_whitespace=True and pass the 3 columns in a list to parse_dates the last step is just to overwrite the column names:
In [96]:
import pandas as pd
import io
t="""Year Mo Da (01,52)
1950 1 1 0.00
1950 1 2 0.00
1950 1 3 0.05
1950 1 4 0.00
1950 1 5 0.07
1950 1 6 0.07"""
df =pd.read_csv(io.StringIO(t), delim_whitespace=True, parse_dates=[['Year','Mo','Da']])
df.columns = ['Datetime', 'Rainfall']
df
Out[96]:
Datetime Rainfall
0 1950-01-01 0.00
1 1950-01-02 0.00
2 1950-01-03 0.05
3 1950-01-04 0.00
4 1950-01-05 0.07
5 1950-01-06 0.07
So I expect: df = pd.read_csv(csv_fl, delim_whitespace=True, parse_dates=[['Year','Mo','Da']])
should work followed by overwriting the column names

filename = "..."
>>> pd.read_csv(filename,
sep=" ",
skipinitialspace=True,
parse_dates={'Datetime': [0, 1, 2]},
usecols=[0, 1, 2, 3],
names=["Y", "M", "D", "Rainfall"],
skiprows=1)
Datetime Rainfall
0 1950-01-01 0.00
1 1950-01-02 0.00
2 1950-01-03 0.05
3 1950-01-04 0.00
4 1950-01-05 0.07
5 1950-01-06 0.07

Related

Populate Pandas Dataframe Based on Column Values Matching Other Column Names

I'd like to populate one dataframe (df2) based on the column names of df2 matching values within a column in another dataframe (df2). Here is a simplified example:
names = list('abcd')
data = list('aadc')
df1 = pd.DataFrame(data,columns=['data'])
df2 = pd.DataFrame(np.empty([4,4]),columns=names)
df1:
data
0 a
1 a
2 d
3 c
df2:
a b c d
0 0.00 0.00 0.00 0.00
1 0.00 0.00 0.00 0.00
2 0.00 0.00 0.00 0.00
3 0.00 0.00 0.00 0.00
I'd like to update df2 so that the first row returns a number (let's say 1 for now) under column a, and 0 for other columns. Second row of df2 would return the same, third frow would return a 0 for column a/b/c and a 1 for column d, fourth row would return a 0 for column a/b/d and a 1 for column c.
Thanks very much for the help!
You can do numpy broadcasting here:
df2[:] = (df1['data'].values[:,None] == df2.columns.values).astype(int)
Or use get_dummies:
df2[:] = pd.get_dummies(df1['data']).reindex(df2.columns, axis=1)
Output:
a b c d
0 1 0 0 0
1 1 0 0 0
2 0 0 0 1
3 0 0 1 0

Combine different tables with time in minutes columns

I have the following data (just a snippet). They start at 0 min and end at 65 min.
R.Time (min) Intensity 215 Intensity 260 Intensity 280
0 0.00000 0 0 0
1 0.01067 0 0 0
2 0.02133 0 0 0
3 0.03200 0 0 0
and
Time %B c B c KCl
0 16.01 0.00 0.0000 0.00
1 16.01 0.00 0.0000 0.00
2 17.00 0.85 0.0085 4.25
3 18.00 1.70 0.0170 8.50
How can I create a dataframe with the time [min] column and all other columns at the correct row for that time. So I need to tell pandas the time column and how to merge and then it sorts the rows? But I also needs to combine rows, when the time is the same.

Concat two DataFrames on missing indices

I have two DataFrames and want to use the second one only on the rows whose index is not already contained in the first one.
What is the most efficient way to do this?
Example:
df_1
idx val
0 0.32
1 0.54
4 0.26
5 0.76
7 0.23
df_2
idx val
1 10.24
2 10.90
3 10.66
4 10.25
6 10.13
7 10.52
df_final
idx val
0 0.32
1 0.54
2 10.90
3 10.66
4 0.26
5 0.76
6 10.13
7 0.23
Recap: I need to add the rows in df_2 for which the index is not already in df_1.
EDIT
Removed some indices in df_2 to illustrate the fact that all indices from df_1 are not covered in df_2.
You can use reindex with combine_first or fillna:
df = df_1.reindex(df_2.index).combine_first(df_2)
print (df)
val
idx
0 0.32
1 0.54
2 10.90
3 10.66
4 0.26
5 0.76
6 10.13
7 0.23
df = df_1.reindex(df_2.index).fillna(df_2)
print (df)
val
idx
0 0.32
1 0.54
2 10.90
3 10.66
4 0.26
5 0.76
6 10.13
7 0.23
You can achieve the wanted output by using the combine_first method of the DataFrame. From the documentation of the method:
Combine two DataFrame objects and default to non-null values in frame calling the method. Result index columns will be the union of the respective indexes and columns
Example usage:
import pandas as pd
df_1 = pd.DataFrame([0.32,0.54,0.26,0.76,0.23], columns=['val'], index=[0,1,4,5,7])
df_1.index.name = 'idx'
df_2 = pd.DataFrame([10.56,10.24,10.90,10.66,10.25,10.13,10.52], columns=['val'], index=[0,1,2,3,4,6,7])
df_2.index.name = 'idx'
df_final = df_1.combine_first(df_2)
This will give the desired result:
In [7]: df_final
Out[7]:
val
idx
0 0.32
1 0.54
2 10.90
3 10.66
4 0.26
5 0.76
6 10.13
7 0.23

Group by one column and find first of two columns pandas

I have one dataframe geomerge, I need to group by one column grpno. and select first of column MaxOfcount percent and first of column state code and display grpno. also. I have rename them as FirstOfMaxOfState count percent and FirstOfstate code
My input dataframe:
count percent grpno. state code MaxOfcount percent
0 14.78 1 CA 14.78
1 0.00 2 CA 0.00
2 0.00 2 FL 0.00
3 8.80 3 CA 8.80
4 0.00 6 NC 0.00
5 0.00 5 NC 0.00
6 59.00 4 MA 59.00
My output dataframe:
FirstOfMaxOfState count percent state pool number FirstOfstate code
0 14.78 1 CA
1 0.00 2 CA
2 8.80 3 CA
3 59.00 4 MA
4 0.00 5 NC
5 0.00 6 NC
Can anyone help on this?
Drop the unneeded column, group by grpno, take the first value, and flatten the multi-index:
df2 = df.drop('count percent', 1).groupby('grpno.').take([0]).reset_index(0)
Rename the columns:
mapping = {'state code':'FirstOfstate code' ,
'grpno.': 'state pool number',
'MaxOfcount percent': 'FirstOfMaxOfState count percent'}
df2.rename_axis(mapping, axis=1)
Result:
>>> df2
state pool number FirstOfMaxOfState count percent FirstOfstate code
0 1 14.78 CA
1 2 0.00 CA
3 3 8.80 CA
6 4 59.00 MA
5 5 0.00 NC
4 6 0.00 NC

find the max of column group by another column pandas

I have a dataframe with 2 columns:
count percent grpno.
0 14.78 1
1 0.00 2
2 8.80 3
3 9.60 4
4 55.90 4
5 0.00 2
6 0.00 6
7 0.00 5
8 6.90 1
9 59.00 4
I need to get the max of column 'count percent
' and group by column 'grpno.'. Though I tried doing the same by
geostat.groupby(['grpno.'], sort=False)['count percent'].max()
I get the output to be
grpno.
1 14.78
2 0.00
3 8.80
4 59.00
6 0.00
5 0.00
Name: count percent, dtype: float64
But I need output to be a dataframe that has the column name modified as 'MaxOfcount percent' and 'grpno.' Can anyone help on this? Thanks
res = df.groupby('grpno.')['count percent'].max().reset_index()
res.columns = ['grpno.', 'MaxOfcount percent']
grpno. MaxOfcount percent
0 1 14.78
1 2 0.00
2 3 8.80
3 4 59.00
4 5 0.00
5 6 0.00
You could also do it in one line:
res = df.groupby('grpno.', as_index=False)['count percent'].max().rename(columns={'count percent': 'MaxOfcount percent'})
You could use groupby with argument as_index=False:
In [119]: df.groupby(['grpno.'], as_index=False)[['count percent']].max()
Out[119]:
grpno. count percent
0 1 14.78
1 2 0.00
2 3 8.80
3 4 59.00
4 5 0.00
5 6 0.00
df1 = df.groupby(['grpno.'], as_index=False)[['count percent']].max()
df1.columns = df1.columns[:-1].tolist() + ['MaxOfcount percent']
In [130]: df1
Out[130]:
grpno. MaxOfcount percent
0 1 14.78
1 2 0.00
2 3 8.80
3 4 59.00
4 5 0.00
5 6 0.00

Categories

Resources