I have data in the following format: Table 1
This data is loaded into a pandas dataframe. The date column is the index for this dataframe. How would I have it so the names become the column headings (must be unique) and the values correspond to the right dates.
So it would look something like this:
Table 2
Consider the following toy DataFrame:
>>> df = pd.DataFrame({'x': [1,2,3,4], 'y':['0 a','2 a','3 b','0 b']})
>>> df
x y
0 1 0 a
1 2 2 a
2 3 3 b
3 4 0 b
Start by processing each row into a Series:
>>> new_columns = df['y'].apply(lambda x: pd.Series(dict([reversed(x.split())])))
>>> new_columns
a b
0 0 NaN
1 2 NaN
2 NaN 3
3 NaN 0
Alternatively, new columns can be generated using pivot (the effect is the same):
>>> new_columns = df['y'].str.split(n=1, expand=True).pivot(columns=1, values=0)
Finally, concatenate the original and the new DataFrame objects:
>>> df = pd.concat([df, new_columns], axis=1)
>>> df
x y a b
0 1 0 a 0 NaN
1 2 2 a 2 NaN
2 3 3 b NaN 3
3 4 0 b NaN 0
Drop any columns that you don't require:
>>> df.drop(['y'], axis=1)
x a b
0 1 0 NaN
1 2 2 NaN
2 3 NaN 3
3 4 NaN 0
You will need to split out the column’s values, then rename your dataframe’s columns, and then you can pivot() the dataframe. I have added the steps below:
df[0].str.split(' ' , expand = True) # assumes you only have the one column
df.columns = ['col_name','values'] # use whatever naming convention you like
df.pivot(columns = 'col_name',values = 'values')
Please let me know if this helps.
Related
Let's say I have 3 different columns
Column1 Column2 Column3
0 a 1 NaN
1 NaN 3 4
2 b 6 7
3 NaN NaN 7
and I want to create 1 final column that would take first value that isn't NA, resulting in:
Column1
0 a
1 3
2 b
3 7
I would usually do this with custom apply function:
df.apply(lambda x: ...)
I need to do this for many different cases with millions of rows and this becomes very slow. Are there any operations that would take advantage of vectorization to make this faster?
Back filling missing values and select first column by [] for one column DataFrame or without for Series:
df1 = df.bfill(axis=1).iloc[:, [0]]
s = df.bfill(axis=1).iloc[:, 0]
You can use pd.fillna() for this, as below:
df['Column1'].fillna(df['Column2']).fillna(df['Column3'])
output:
0 a
1 3
2 b
3 7
For more than 3 columns, this can be placed in a for loop as below, with new_col being your output:
new_col = df['Column1']
for col in df.columns:
new_col = new_col.fillna(df[col])
I am trying to add a column in a dataset, based on a dictionary which is applied to one of the columns in the dataset. But after trying the code below, I am getting NaN in the new column even though, values are not missing from the column on which the dictionary is based on.
Code:
import pandas as pd
df = pd.read_csv('test.csv')
val_dict = {'1':'8','2':'5','3':'3','4':'2'}
df['val2'] = df['val'].map(val_dict)
df
The output I am getting is
val val2
Based on your df, i assume the column val contains interger value. But the dictionary which you presented above contain the keys as str.
So change the dict keys from str to int. (i.e val_dict = {1:'8',2:'5',3:'3',4:'2'})
E.g : 1 (Shows Error)
df = pd.DataFrame({'val' : [1,2,2,1,2,3,3,4]})
val_dict = {'1':'8','2':'5','3':'3','4':'2'}
df['val_2'] = df['val'].map(val_dict)
print(df)
val val_2
0 1 NaN
1 2 NaN
2 2 NaN
3 1 NaN
4 2 NaN
5 3 NaN
6 3 NaN
7 4 NaN
E.g : 2 (Corrected dict results)
df = pd.DataFrame({'val' : [1,2,2,1,2,3,3,4]})
val_dict = {1:'8',2:'5',3:'3',4:'2'}
df['val_2'] = df['val'].map(val_dict)
print(df)
val val_2
0 1 8
1 2 5
2 2 5
3 1 8
4 2 5
5 3 3
6 3 3
7 4 2
I have a dataframe and a list
df=pd.read_csv('aa.csv')
temp=['1','2','3','4','5','6','7']`
Now my data-frame have only 3 rows. I am adding temp as a new column
df['temp']=pd.Series(temp)
But in the final df i am only getting first 3 values of temp and all others are rejected. Is there any way to add a list of larger/smaller in size as a new column to the dataframe
Thanks
Use DataFrame.reindex for create rows filled by missing values before created new column:
df = pd.read_csv('aa.csv')
temp = ['1','2','3','4','5','6','7']
df = df.reindex(range(len(temp)))
df['temp'] = pd.Series(temp)
Sample:
df = pd.DataFrame({'A': [1,2,3]})
print(df)
A
0 1
1 2
2 3
temp = ['1','2','3','4','5','6','7']
df = df.reindex(range(len(temp)))
df['temp']=pd.Series(temp)
print (df)
A temp
0 1.0 1
1 2.0 2
2 3.0 3
3 NaN 4
4 NaN 5
5 NaN 6
6 NaN 7
Or use concat with Series with specify name for new column name:
s = pd.Series(temp, name='temp')
df = pd.concat([df, s], axis=1)
Similar:
s = pd.Series(temp)
df = pd.concat([df, s.rename('temp')], axis=1)
print (df)
A temp
0 1.0 1
1 2.0 2
2 3.0 3
3 NaN 4
4 NaN 5
5 NaN 6
6 NaN 7
I want to create a new column in my dataframe that places the name of the column in the row if only that column has a value of 8 in the respective row, otherwise the new column's value for the row would be "NONE". For the dataframe df, the new column df["New_Column"] = ["NONE","NONE","A","NONE"]
df = pd.DataFrame({"A": [1, 2,8,3], "B": [0, 2,4,8], "C": [0, 0,7,8]})
Cool problem.
Find the 8-fields in each row: df==8
Count them: (df==8).sum(axis=1)
Find the rows where the count is 1: (df==8).sum(axis=1)==1
Select just those rows from the original dataframe: df[(df==8).sum(axis=1)==1]==8
Find the 8-fields again: df[(df==8).sum(axis=1)==1]==8)
Find the columns that hold the True values with idxmax (because True>False): (df[(df==8).sum(axis=1)==1]==8).idxmax(axis=1)
Fill in the gaps with "NONE"
To summarize:
df["New_Column"] = (df[(df==8).sum(axis=1)==1]==8).idxmax(axis=1)
df["New_Column"] = df["New_Column"].fillna("NONE")
# A B C New_Column
#0 1 0 0 NONE
#1 2 2 0 NONE
#2 8 4 7 A
#3 3 8 8 NONE
# I added another line as a proof of concept
#4 0 8 0 B
You can accomplish this using idxmax and a mask:
out = (df==8).idxmax(1)
m = ~(df==8).any(1) | ((df==8).sum(1) > 1)
df.assign(col=out.mask(m))
A B C col
0 1 0 0 NaN
1 2 2 0 NaN
2 8 4 7 A
3 3 8 8 NaN
Or do:
df2=df[(df==8)]
df['New_Column']=(df2[(df2!=df2.dropna(thresh=2).values[0]).all(1)].dropna(how='all')).idxmax(1)
df['New_Column'] = df['New_Column'].fillna('NONE')
print(df)
dropna + dropna again + idxmax + fillna. that's all you need for this.
Output:
A B C New_Column
0 1 0 0 NONE
1 2 2 0 NONE
2 8 4 7 A
3 3 8 8 NONE
Say I have a Dataframe
df = pd.DataFrame({'A':[0,1],'B':[2,3]})
A B
0 0 2
1 1 3
Then I have a Series generated by some other function using inputs from the first row of the df but which has no overlap with the existing df
s = pd.Series ({'C':4,'D':6})
C 4
D 6
Now I want to add s to df.loc[0] with the keys becoming new columns and the values added only to this first row. The end result for df should look like:
A B C D
0 0 2 4 6
1 1 3 NaN NaN
How would I do that? Similar questions I've found only seem to look at doing this for one column or just adding the Series as a new row at the end of the DataFrame but not updating an existing row by adding multiple new columns from a Series.
I've tried df.loc[0,list(['C','D'])] = [4,6] which was suggested in another answer but that only works if ['C','D'] are already existing columns in the Dataframe. df.assign(**s) works but then assigns the Series values to all rows.
join with transpose:
df.join(pd.DataFrame(s).T)
A B C D
0 0 2 4.0 6.0
1 1 3 NaN NaN
Or use concat
pd.concat([df, pd.DataFrame(s).T], axis=1)
A B C D
0 0 2 4.0 6.0
1 1 3 NaN NaN