Populate Pandas Dataframe Based on Column Values Matching Other Column Names - python

I'd like to populate one dataframe (df2) based on the column names of df2 matching values within a column in another dataframe (df2). Here is a simplified example:
names = list('abcd')
data = list('aadc')
df1 = pd.DataFrame(data,columns=['data'])
df2 = pd.DataFrame(np.empty([4,4]),columns=names)
df1:
data
0 a
1 a
2 d
3 c
df2:
a b c d
0 0.00 0.00 0.00 0.00
1 0.00 0.00 0.00 0.00
2 0.00 0.00 0.00 0.00
3 0.00 0.00 0.00 0.00
I'd like to update df2 so that the first row returns a number (let's say 1 for now) under column a, and 0 for other columns. Second row of df2 would return the same, third frow would return a 0 for column a/b/c and a 1 for column d, fourth row would return a 0 for column a/b/d and a 1 for column c.
Thanks very much for the help!

You can do numpy broadcasting here:
df2[:] = (df1['data'].values[:,None] == df2.columns.values).astype(int)
Or use get_dummies:
df2[:] = pd.get_dummies(df1['data']).reindex(df2.columns, axis=1)
Output:
a b c d
0 1 0 0 0
1 1 0 0 0
2 0 0 0 1
3 0 0 1 0

Related

Rename Pandas column values

I have pandas dataframe like this
Column 1
Column 2
1
a
2
a
3
b
4
c
5
d
I want to name the column 2 as:
Column 1
Column 2
1
row1
2
row1
3
row2
4
row3
5
row4
I am trying the ways that are hard coded, like renaming each column, but in practice I have lots of rows so hard coding is not possible, is there any function or something python that can do the same task for me?
Let's try Series.factorize
df['Column2'] = (pd.Series(df['Column2'].factorize()[0])
.add(1).astype(str).radd('row'))
print(df)
Column1 Column2
0 1 row1
1 2 row1
2 3 row2
3 4 row3
4 5 row4

Most efficient way to multiply every column of a large pandas dataframe with every other column of the same dataframe

Suppose I have a dataset that looks something like:
INDEX A B C
1 1 1 0.75
2 1 1 1
3 1 0 0.35
4 0 0 1
5 1 1 0
I want to get a dataframe that looks like the following, with the original columns, and all possible interactions between columns:
INDEX A B C A_B A_C B_C
1 1 1 0.75 1 0.75 0.75
2 1 1 1 1 1 1
3 1 0 0.35 0 0.35 0
4 0 0 1 0 0 0
5 1 1 0 1 0 0
My actual datasets are pretty large (~100 columns). What is the fastest way to achieve this?
I could, of course, do a nested loop, or similar, to achieve this but I was hoping there is a more efficient way.
You could use itertools.combinations for this:
>>> import pandas as pd
>>> from itertools import combinations
>>> df = pd.DataFrame({
... "A": [1,1,1,0,1],
... "B": [1,1,0,0,1],
... "C": [.75,1,.35,1,0]
... })
>>> df.head()
A B C
0 1 1 0.75
1 1 1 1.00
2 1 0 0.35
3 0 0 1.00
4 1 1 0.00
>>> for col1, col2 in combinations(df.columns, 2):
... df[f"{col1}_{col2}"] = df[col1] * df[col2]
...
>>> df.head()
A B C A_B A_C B_C
0 1 1 0.75 1 0.75 0.75
1 1 1 1.00 1 1.00 1.00
2 1 0 0.35 0 0.35 0.00
3 0 0 1.00 0 0.00 0.00
4 1 1 0.00 1 0.00 0.00
If you need to vectorize an arbitrary function on the pairs of columns you could use:
import numpy as np
def fx(x, y):
return np.multiply(x, y)
for col1, col2 in combinations(df.columns, 2):
df[f"{col1}_{col2}"] = np.vectorize(fx)(df[col1], df[col2])
I am not aware of a native pandas function to solve this, but itertools.combinations would be an improvement over a nested loop.
You could do something like:
from itertools import combinations
df = pd.DataFrame(data={"A": [1,1,1,0,1],
"B": [1,1,0,0,1],
"C": [0.75, 1, 0.35, 1, 0]})
for comb in combinations(df.columns, 2):
col_name = comb[0] + "_" + comb[1]
result[col_name] = df[comb[0]] * df[comb[1]]

Pandas add new columns in subloops back to the main dataframe

I have a dataframe looks like this:
ids value
1 0.1
1 0.2
1 0.14
2 0.22
....
I am trying to loop through each ids and calculate new columns for each id.
for id, row in df.groupby('ids'):
x = row.loc[0, 'value']
for i in range (len(row)):
row.loc[i, 'new_col_1'] = i * x
row.loc[i, 'new_col_2'] = i * x * 10
My goal is to add the 2 new columns for each id back to the original dataframe, so my df would look like this:
ids value new_col_1 new_col_2
1 0.1 0 0
1 0.2 0.2 2
1 0.14 0.28 2.8
2 0.22 0 0
....
cumcount
With a little Numpy broadcasting sprinkled in.
cumcount gets you your for i in range(len(df)) bit
df.groupby('ids').cumcount()
0 0
1 1
2 2
3 0
dtype: int64
c = df.groupby('ids').cumcount()
v = df.value
df.join(
pd.DataFrame(
(c.values * v.values)[:, None] * [1, 10],
df.index,
).rename(columns=lambda x: f"new_col_{x + 1}")
)
ids value new_col_1 new_col_2
0 1 0.10 0.00 0.0
1 1 0.20 0.20 2.0
2 1 0.14 0.28 2.8
3 2 0.22 0.00 0.0

Pandas read csv with multiple whitespaces and parse dates

I have a csv file that looks like
Year Mo Da (01,52)
1950 1 1 0.00
1950 1 2 0.00
1950 1 3 0.05
1950 1 4 0.00
1950 1 5 0.07
1950 1 6 0.07
and I would like transform it into a dataframe with 2 columns: a datetime column of YYYYMMDD (using the "Year", "Mo", and "Da" columns in the raw data) and then the rainfall at the grid point (e.g. 01, 52) as the second column.
A desired output would be:
Datetime Rainfall
19500101 0.00
19500102 0.00
19500103 0.05
I am stuck on two issues: appropriately accounting for the whitespace during the read-in and properly using parse_dates.
The simple read-in command:
df = pd.read_csv(csv_fl)
Almost correctly reads in the headers, but splits the (01,52) into separate columns, yielding a trailing NaN, which shouldn't be there.
Year Mo Da (01 52)
0 1950 1 1 0.00 NaN
And trying to parse the dates using
df = pd.read_csv(csv_fl, parse_dates={'Datetime':[0,1,2]}, index_col=0)
leads to an IndexError
colnames.append(str(columns[c]))
IndexError: list index out of range
Any guidance is much appreciated.
If you pass params delim_whitespace=True and pass the 3 columns in a list to parse_dates the last step is just to overwrite the column names:
In [96]:
import pandas as pd
import io
t="""Year Mo Da (01,52)
1950 1 1 0.00
1950 1 2 0.00
1950 1 3 0.05
1950 1 4 0.00
1950 1 5 0.07
1950 1 6 0.07"""
df =pd.read_csv(io.StringIO(t), delim_whitespace=True, parse_dates=[['Year','Mo','Da']])
df.columns = ['Datetime', 'Rainfall']
df
Out[96]:
Datetime Rainfall
0 1950-01-01 0.00
1 1950-01-02 0.00
2 1950-01-03 0.05
3 1950-01-04 0.00
4 1950-01-05 0.07
5 1950-01-06 0.07
So I expect: df = pd.read_csv(csv_fl, delim_whitespace=True, parse_dates=[['Year','Mo','Da']])
should work followed by overwriting the column names
filename = "..."
>>> pd.read_csv(filename,
sep=" ",
skipinitialspace=True,
parse_dates={'Datetime': [0, 1, 2]},
usecols=[0, 1, 2, 3],
names=["Y", "M", "D", "Rainfall"],
skiprows=1)
Datetime Rainfall
0 1950-01-01 0.00
1 1950-01-02 0.00
2 1950-01-03 0.05
3 1950-01-04 0.00
4 1950-01-05 0.07
5 1950-01-06 0.07

Group by one column and find first of two columns pandas

I have one dataframe geomerge, I need to group by one column grpno. and select first of column MaxOfcount percent and first of column state code and display grpno. also. I have rename them as FirstOfMaxOfState count percent and FirstOfstate code
My input dataframe:
count percent grpno. state code MaxOfcount percent
0 14.78 1 CA 14.78
1 0.00 2 CA 0.00
2 0.00 2 FL 0.00
3 8.80 3 CA 8.80
4 0.00 6 NC 0.00
5 0.00 5 NC 0.00
6 59.00 4 MA 59.00
My output dataframe:
FirstOfMaxOfState count percent state pool number FirstOfstate code
0 14.78 1 CA
1 0.00 2 CA
2 8.80 3 CA
3 59.00 4 MA
4 0.00 5 NC
5 0.00 6 NC
Can anyone help on this?
Drop the unneeded column, group by grpno, take the first value, and flatten the multi-index:
df2 = df.drop('count percent', 1).groupby('grpno.').take([0]).reset_index(0)
Rename the columns:
mapping = {'state code':'FirstOfstate code' ,
'grpno.': 'state pool number',
'MaxOfcount percent': 'FirstOfMaxOfState count percent'}
df2.rename_axis(mapping, axis=1)
Result:
>>> df2
state pool number FirstOfMaxOfState count percent FirstOfstate code
0 1 14.78 CA
1 2 0.00 CA
3 3 8.80 CA
6 4 59.00 MA
5 5 0.00 NC
4 6 0.00 NC

Categories

Resources