Populate Pandas Dataframe Based on Column Values Matching Other Column Names

Populate Pandas Dataframe Based on Column Values Matching Other Column Names - python

I'd like to populate one dataframe (df2) based on the column names of df2 matching values within a column in another dataframe (df2). Here is a simplified example:
names = list('abcd')
data = list('aadc')
df1 = pd.DataFrame(data,columns=['data'])
df2 = pd.DataFrame(np.empty([4,4]),columns=names)
df1:
data
0 a
1 a
2 d
3 c
df2:
a b c d
0 0.00 0.00 0.00 0.00
1 0.00 0.00 0.00 0.00
2 0.00 0.00 0.00 0.00
3 0.00 0.00 0.00 0.00
I'd like to update df2 so that the first row returns a number (let's say 1 for now) under column a, and 0 for other columns. Second row of df2 would return the same, third frow would return a 0 for column a/b/c and a 1 for column d, fourth row would return a 0 for column a/b/d and a 1 for column c.
Thanks very much for the help!

You can do numpy broadcasting here:
df2[:] = (df1['data'].values[:,None] == df2.columns.values).astype(int)
Or use get_dummies:
df2[:] = pd.get_dummies(df1['data']).reindex(df2.columns, axis=1)
Output:
a b c d
0 1 0 0 0
1 1 0 0 0
2 0 0 0 1
3 0 0 1 0

Related

Rename Pandas column values

I have pandas dataframe like this
Column 1
Column 2
1
a
2
a
3
b
4
c
5
d
I want to name the column 2 as:
Column 1
Column 2
1
row1
2
row1
3
row2
4
row3
5
row4
I am trying the ways that are hard coded, like renaming each column, but in practice I have lots of rows so hard coding is not possible, is there any function or something python that can do the same task for me?

Let's try Series.factorize
df['Column2'] = (pd.Series(df['Column2'].factorize()[0])
.add(1).astype(str).radd('row'))
print(df)
Column1 Column2
0 1 row1
1 2 row1
2 3 row2
3 4 row3
4 5 row4

Most efficient way to multiply every column of a large pandas dataframe with every other column of the same dataframe

Suppose I have a dataset that looks something like:
INDEX A B C
1 1 1 0.75
2 1 1 1
3 1 0 0.35
4 0 0 1
5 1 1 0
I want to get a dataframe that looks like the following, with the original columns, and all possible interactions between columns:
INDEX A B C A_B A_C B_C
1 1 1 0.75 1 0.75 0.75
2 1 1 1 1 1 1
3 1 0 0.35 0 0.35 0
4 0 0 1 0 0 0
5 1 1 0 1 0 0
My actual datasets are pretty large (~100 columns). What is the fastest way to achieve this?
I could, of course, do a nested loop, or similar, to achieve this but I was hoping there is a more efficient way.

You could use itertools.combinations for this:
>>> import pandas as pd
>>> from itertools import combinations
>>> df = pd.DataFrame({
... "A": [1,1,1,0,1],
... "B": [1,1,0,0,1],
... "C": [.75,1,.35,1,0]
... })
>>> df.head()
A B C
0 1 1 0.75
1 1 1 1.00
2 1 0 0.35
3 0 0 1.00
4 1 1 0.00
>>> for col1, col2 in combinations(df.columns, 2):
... df[f"{col1}_{col2}"] = df[col1] * df[col2]
...
>>> df.head()
A B C A_B A_C B_C
0 1 1 0.75 1 0.75 0.75
1 1 1 1.00 1 1.00 1.00
2 1 0 0.35 0 0.35 0.00
3 0 0 1.00 0 0.00 0.00
4 1 1 0.00 1 0.00 0.00
If you need to vectorize an arbitrary function on the pairs of columns you could use:
import numpy as np
def fx(x, y):
return np.multiply(x, y)
for col1, col2 in combinations(df.columns, 2):
df[f"{col1}_{col2}"] = np.vectorize(fx)(df[col1], df[col2])

I am not aware of a native pandas function to solve this, but itertools.combinations would be an improvement over a nested loop.
You could do something like:
from itertools import combinations
df = pd.DataFrame(data={"A": [1,1,1,0,1],
"B": [1,1,0,0,1],
"C": [0.75, 1, 0.35, 1, 0]})
for comb in combinations(df.columns, 2):
col_name = comb[0] + "_" + comb[1]
result[col_name] = df[comb[0]] * df[comb[1]]

Pandas add new columns in subloops back to the main dataframe

I have a dataframe looks like this:
ids value
1 0.1
1 0.2
1 0.14
2 0.22
....
I am trying to loop through each ids and calculate new columns for each id.
for id, row in df.groupby('ids'):
x = row.loc[0, 'value']
for i in range (len(row)):
row.loc[i, 'new_col_1'] = i * x
row.loc[i, 'new_col_2'] = i * x * 10
My goal is to add the 2 new columns for each id back to the original dataframe, so my df would look like this:
ids value new_col_1 new_col_2
1 0.1 0 0
1 0.2 0.2 2
1 0.14 0.28 2.8
2 0.22 0 0
....

cumcount
With a little Numpy broadcasting sprinkled in.
cumcount gets you your for i in range(len(df)) bit
df.groupby('ids').cumcount()
0 0
1 1
2 2
3 0
dtype: int64
c = df.groupby('ids').cumcount()
v = df.value
df.join(
pd.DataFrame(
(c.values * v.values)[:, None] * [1, 10],
df.index,
).rename(columns=lambda x: f"new_col_{x + 1}")
)
ids value new_col_1 new_col_2
0 1 0.10 0.00 0.0
1 1 0.20 0.20 2.0
2 1 0.14 0.28 2.8
3 2 0.22 0.00 0.0

Pandas read csv with multiple whitespaces and parse dates

I have a csv file that looks like
Year Mo Da (01,52)
1950 1 1 0.00
1950 1 2 0.00
1950 1 3 0.05
1950 1 4 0.00
1950 1 5 0.07
1950 1 6 0.07
and I would like transform it into a dataframe with 2 columns: a datetime column of YYYYMMDD (using the "Year", "Mo", and "Da" columns in the raw data) and then the rainfall at the grid point (e.g. 01, 52) as the second column.
A desired output would be:
Datetime Rainfall
19500101 0.00
19500102 0.00
19500103 0.05
I am stuck on two issues: appropriately accounting for the whitespace during the read-in and properly using parse_dates.
The simple read-in command:
df = pd.read_csv(csv_fl)
Almost correctly reads in the headers, but splits the (01,52) into separate columns, yielding a trailing NaN, which shouldn't be there.
Year Mo Da (01 52)
0 1950 1 1 0.00 NaN
And trying to parse the dates using
df = pd.read_csv(csv_fl, parse_dates={'Datetime':[0,1,2]}, index_col=0)
leads to an IndexError
colnames.append(str(columns[c]))
IndexError: list index out of range
Any guidance is much appreciated.

If you pass params delim_whitespace=True and pass the 3 columns in a list to parse_dates the last step is just to overwrite the column names:
In [96]:
import pandas as pd
import io
t="""Year Mo Da (01,52)
1950 1 1 0.00
1950 1 2 0.00
1950 1 3 0.05
1950 1 4 0.00
1950 1 5 0.07
1950 1 6 0.07"""
df =pd.read_csv(io.StringIO(t), delim_whitespace=True, parse_dates=[['Year','Mo','Da']])
df.columns = ['Datetime', 'Rainfall']
df
Out[96]:
Datetime Rainfall
0 1950-01-01 0.00
1 1950-01-02 0.00
2 1950-01-03 0.05
3 1950-01-04 0.00
4 1950-01-05 0.07
5 1950-01-06 0.07
So I expect: df = pd.read_csv(csv_fl, delim_whitespace=True, parse_dates=[['Year','Mo','Da']])
should work followed by overwriting the column names

filename = "..."
>>> pd.read_csv(filename,
sep=" ",
skipinitialspace=True,
parse_dates={'Datetime': [0, 1, 2]},
usecols=[0, 1, 2, 3],
names=["Y", "M", "D", "Rainfall"],
skiprows=1)
Datetime Rainfall
0 1950-01-01 0.00
1 1950-01-02 0.00
2 1950-01-03 0.05
3 1950-01-04 0.00
4 1950-01-05 0.07
5 1950-01-06 0.07

Group by one column and find first of two columns pandas

I have one dataframe geomerge, I need to group by one column grpno. and select first of column MaxOfcount percent and first of column state code and display grpno. also. I have rename them as FirstOfMaxOfState count percent and FirstOfstate code
My input dataframe:
count percent grpno. state code MaxOfcount percent
0 14.78 1 CA 14.78
1 0.00 2 CA 0.00
2 0.00 2 FL 0.00
3 8.80 3 CA 8.80
4 0.00 6 NC 0.00
5 0.00 5 NC 0.00
6 59.00 4 MA 59.00
My output dataframe:
FirstOfMaxOfState count percent state pool number FirstOfstate code
0 14.78 1 CA
1 0.00 2 CA
2 8.80 3 CA
3 59.00 4 MA
4 0.00 5 NC
5 0.00 6 NC
Can anyone help on this?

Drop the unneeded column, group by grpno, take the first value, and flatten the multi-index:
df2 = df.drop('count percent', 1).groupby('grpno.').take([0]).reset_index(0)
Rename the columns:
mapping = {'state code':'FirstOfstate code' ,
'grpno.': 'state pool number',
'MaxOfcount percent': 'FirstOfMaxOfState count percent'}
df2.rename_axis(mapping, axis=1)
Result:
>>> df2
state pool number FirstOfMaxOfState count percent FirstOfstate code
0 1 14.78 CA
1 2 0.00 CA
3 3 8.80 CA
6 4 59.00 MA
5 5 0.00 NC
4 6 0.00 NC

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Populate Pandas Dataframe Based on Column Values Matching Other Column Names - python

You can do numpy broadcasting here: df2[:] = (df1['data'].values[:,None] == df2.columns.values).astype(int) Or use get_dummies: df2[:] = pd.get_dummies(df1['data']).reindex(df2.columns, axis=1) Output: a b c d 0 1 0 0 0 1 1 0 0 0 2 0 0 0 1 3 0 0 1 0

Related

Rename Pandas column values

Most efficient way to multiply every column of a large pandas dataframe with every other column of the same dataframe

Pandas add new columns in subloops back to the main dataframe

Pandas read csv with multiple whitespaces and parse dates

Group by one column and find first of two columns pandas

Categories

Resources