Append data with one column to existing dataframe - python

I want append a list of data to a dataframe such that the list will appear in a column ie:
#Existing dataframe:
[A, 20150901, 20150902
1 4 5
4 2 7]
#list of data to append to column A:
data = [8,9,4]
#Required dataframe
[A, 20150901, 20150902
1 4 5
4 2 7
8, 0 0
9 0 0
4 0 0]
I am using the following:
df_new = df.copy(deep=True)
#I am copying and deleting data as column names are type Timestamp and easier to reuse them
df_new.drop(df_new.index, inplace=True)
for item in data_list:
df_new = df_new.append([{'A':item}], ignore_index=True)
df_new.fillna(0, inplace=True)
df = pd.concat([df, df_new], axis=0, ignore_index=True)
But doing this in a loop is inefficient plus I get this warning:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.
Any ideas on how to overcome this error and append 2 dataframes in one go?

I think need concat new DataFrame with column A, then reindex if want same order of columns and last replace missing values by fillna:
data = [8,9,4]
df_new = pd.DataFrame({'A':data})
df = (pd.concat([df, df_new], ignore_index=True)
.reindex(columns=df.columns)
.fillna(0, downcast='infer'))
print (df)
A 20150901 20150902
0 1 4 5
1 4 2 7
2 8 0 0
3 9 0 0
4 4 0 0

I think, you could do something like this.
df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
df2 = pd.DataFrame({'A':[8,9,4]})
df.append(df2).fillna(0)
A B
0 1 2.0
1 3 4.0
0 8 0.0
1 9 0.0
2 4 0.0

maybe you can do it in this way:
new = pd.DataFrame(np.zeros((3, 3))) #Create a new zero dataframe:
new[0]=[8,9,4] #add values
existed_dataframe.append(new) #and merge both dataframes

Related

Python Pandas Change Column to Headings

I have data in the following format: Table 1
This data is loaded into a pandas dataframe. The date column is the index for this dataframe. How would I have it so the names become the column headings (must be unique) and the values correspond to the right dates.
So it would look something like this:
Table 2
Consider the following toy DataFrame:
>>> df = pd.DataFrame({'x': [1,2,3,4], 'y':['0 a','2 a','3 b','0 b']})
>>> df
x y
0 1 0 a
1 2 2 a
2 3 3 b
3 4 0 b
Start by processing each row into a Series:
>>> new_columns = df['y'].apply(lambda x: pd.Series(dict([reversed(x.split())])))
>>> new_columns
a b
0 0 NaN
1 2 NaN
2 NaN 3
3 NaN 0
Alternatively, new columns can be generated using pivot (the effect is the same):
>>> new_columns = df['y'].str.split(n=1, expand=True).pivot(columns=1, values=0)
Finally, concatenate the original and the new DataFrame objects:
>>> df = pd.concat([df, new_columns], axis=1)
>>> df
x y a b
0 1 0 a 0 NaN
1 2 2 a 2 NaN
2 3 3 b NaN 3
3 4 0 b NaN 0
Drop any columns that you don't require:
>>> df.drop(['y'], axis=1)
x a b
0 1 0 NaN
1 2 2 NaN
2 3 NaN 3
3 4 NaN 0
You will need to split out the column’s values, then rename your dataframe’s columns, and then you can pivot() the dataframe. I have added the steps below:
df[0].str.split(' ' , expand = True) # assumes you only have the one column
df.columns = ['col_name','values'] # use whatever naming convention you like
df.pivot(columns = 'col_name',values = 'values')
Please let me know if this helps.

Combine 3 dataframe columns into 1 with priority while avoiding apply

Let's say I have 3 different columns
Column1 Column2 Column3
0 a 1 NaN
1 NaN 3 4
2 b 6 7
3 NaN NaN 7
and I want to create 1 final column that would take first value that isn't NA, resulting in:
Column1
0 a
1 3
2 b
3 7
I would usually do this with custom apply function:
df.apply(lambda x: ...)
I need to do this for many different cases with millions of rows and this becomes very slow. Are there any operations that would take advantage of vectorization to make this faster?
Back filling missing values and select first column by [] for one column DataFrame or without for Series:
df1 = df.bfill(axis=1).iloc[:, [0]]
s = df.bfill(axis=1).iloc[:, 0]
You can use pd.fillna() for this, as below:
df['Column1'].fillna(df['Column2']).fillna(df['Column3'])
output:
0 a
1 3
2 b
3 7
For more than 3 columns, this can be placed in a for loop as below, with new_col being your output:
new_col = df['Column1']
for col in df.columns:
new_col = new_col.fillna(df[col])

Combining a list of tuple dataframes in python

I have a large dataset where every two rows needs to be group together and combined into one longer row, basically duplicating the headers and adding the 2nd row to the 1st. Here is a small sample:
df = pd.DataFrame({'ID' : [1,1,2,2],'Var1': ['A', 2, 'C', 7], 'Var2': ['B', 5, 'D', 9]})
print(df)
ID Var1 Var2
1 A B
1 2 5
2 C D
2 7 9
I will have to group the rows my 'ID' so therefore I ran:
grouped = df.groupby(['ID'])
grp_lst = list(grouped)
This resulted in a list of tuples grouped by id where element 1 is the grouped dataframe I would like to combine.
The desired result is a dataframe that looks something like this:
ID Var1 Var2 ID.1 Var1.1 Var2.1
1 A B 1 2 5
2 C D 2 7 9
I have to do this over a large data set, where the "ID" is used to group the rows and then I want to basically add the bottom row to end on the top.
Any help would be appreciated and I assume there is a much easier way to do this than I am doing.
Thanks in advance!
Let us try:
i = df.groupby('ID').cumcount().astype(str)
df_out = df.set_index([df['ID'].values, i]).stack().unstack([2, 1])
df_out.columns = df_out.columns.map('.'.join)
Details:
group the dataframe on ID and use cumcount to create sequential counter to uniquely identify the rows per ID:
>>> i
0 0
1 1
2 0
3 1
dtype: object
Create multilevel index in the dataframe with the first level set to ID values and second level set to the above sequential counter, then use stack followed by unstack to reshape the dataframe in the desired format:
>>> df_out
ID Var1 Var2 ID Var1 Var2 #---> Level 0 columns
0 0 0 1 1 1 #---> Level 1 columns
1 1 A B 1 2 5
2 2 C D 2 7 9
Finally flatten the multilevel columns using Index.map with join:
>>> df_out
ID.0 Var1.0 Var2.0 ID.1 Var1.1 Var2.1
1 1 A B 1 2 5
2 2 C D 2 7 9
Here is another way using numpy to reshape the dataframe first then tile the columns and create new dataframe from reshape values and tiled columns:
s = df.shape[1]
c = np.tile(df.columns, 2) + '.' + (np.arange(s * 2) // s).astype(str)
df_out = pd.DataFrame(df.values.reshape(-1, s * 2), columns=c)
>>> df_out
ID.0 Var1.0 Var2.0 ID.1 Var1.1 Var2.1
0 1 A B 1 2 5
1 2 C D 2 7 9
Note: This method is only applicable if you have two rows per ID and the ID columns is already sorted.

Adding new column from list in dataframe But List have more values than total no of rows in dataframe

I have a dataframe and a list
df=pd.read_csv('aa.csv')
temp=['1','2','3','4','5','6','7']`
Now my data-frame have only 3 rows. I am adding temp as a new column
df['temp']=pd.Series(temp)
But in the final df i am only getting first 3 values of temp and all others are rejected. Is there any way to add a list of larger/smaller in size as a new column to the dataframe
Thanks
Use DataFrame.reindex for create rows filled by missing values before created new column:
df = pd.read_csv('aa.csv')
temp = ['1','2','3','4','5','6','7']
df = df.reindex(range(len(temp)))
df['temp'] = pd.Series(temp)
Sample:
df = pd.DataFrame({'A': [1,2,3]})
print(df)
A
0 1
1 2
2 3
temp = ['1','2','3','4','5','6','7']
df = df.reindex(range(len(temp)))
df['temp']=pd.Series(temp)
print (df)
A temp
0 1.0 1
1 2.0 2
2 3.0 3
3 NaN 4
4 NaN 5
5 NaN 6
6 NaN 7
Or use concat with Series with specify name for new column name:
s = pd.Series(temp, name='temp')
df = pd.concat([df, s], axis=1)
Similar:
s = pd.Series(temp)
df = pd.concat([df, s.rename('temp')], axis=1)
print (df)
A temp
0 1.0 1
1 2.0 2
2 3.0 3
3 NaN 4
4 NaN 5
5 NaN 6
6 NaN 7

Adding rows to a Dataframe to unify the length of groups

I would like to add element to specific groups in a Pandas DataFrame in a selective way. In particular, I would like to add zeros so that all groups have the same number of elements. The following is a simple example:
import pandas as pd
df = pd.DataFrame([[1,1], [2,2], [1,3], [2,4], [2,5]], columns=['key', 'value'])
df
key value
0 1 1
1 2 2
2 1 3
3 2 4
4 2 5
I would like to have the same number of elements per group (where grouping is by the key column). The group 2 has the most elements: three elements. However, the group 1 has only two elements so a zeros should be added as follows:
key value
0 1 1
1 2 2
2 1 3
3 2 4
4 2 5
5 1 0
Note that the index does not matter.
You can create new level of MultiIndex by cumcount and then add missing values by unstack/stack or reindex:
df = (df.set_index(['key', df.groupby('key').cumcount()])['value']
.unstack(fill_value=0)
.stack()
.reset_index(level=1, drop=True)
.reset_index(name='value'))
Alternative solution:
df = df.set_index(['key', df.groupby('key').cumcount()])
mux = pd.MultiIndex.from_product(df.index.levels, names = df.index.names)
df = df.reindex(mux, fill_value=0).reset_index(level=1, drop=True).reset_index()
print (df)
key value
0 1 1
1 1 3
2 1 0
3 2 2
4 2 4
5 2 5
If is important order of values:
df1 = df.set_index(['key', df.groupby('key').cumcount()])
mux = pd.MultiIndex.from_product(df1.index.levels, names = df1.index.names)
#get appended values
miss = mux.difference(df1.index).get_level_values(0)
#create helper df and add 0 to all columns of original df
df2 = pd.DataFrame({'key':miss}).reindex(columns=df.columns, fill_value=0)
#append to original df
df = pd.concat([df, df2], ignore_index=True)
print (df)
key value
0 1 1
1 2 2
2 1 3
3 2 4
4 2 5
5 1 0

Categories

Resources