Hello I am trying to insert 3 empty rows after each row of the current data using pandas then export the data. For example a sample current data could be:
name profession
Bill cashier
Sam stock
Adam security
Ideally what I want to achieve:
name profession
Bill cashier
Nan Nan
Nan Nan
Nan Nan
Sam stock
Nan Nan
Nan Nan
Nan Nan
Adam security
Nan Nan
Nan Nan
Nan Nan
I have experimented with itertools however i am not sure how i can precisely get three empty rows using after each row using this method. Any help, guidance, sample would definitely be appreciative!
Using append on a dataframe is quite inefficient I believe (has to reallocate memory for the entire data frame each time).
DataFrames were meant for analyzing data and easily adding columns—but not rows.
So I think a good approach would be to create a new dataframe of the correct size and then transfer the data over to it. Easiest way to do that is using an index.
# Demonstration data
data = 'name profession Bill cashier Sam stock Adam security'
data = np.array(data.split()).reshape((4,2))
df = pd.DataFrame(data[1:],columns=data[0])
# Add n blank rows
n = 3
new_index = pd.RangeIndex(len(df)*(n+1))
new_df = pd.DataFrame(np.nan, index=new_index, columns=df.columns)
ids = np.arange(len(df))*(n+1)
new_df.loc[ids] = df.values
print(new_df)
Output:
name profession
0 Bill cashier
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 Sam stock
5 NaN NaN
6 NaN NaN
7 NaN NaN
8 Adam security
9 NaN NaN
10 NaN NaN
11 NaN NaN
insert_rows = 3 # how many rows to insert
df.index = range(0, insert_rows * len(df), insert_rows)
# create new_df with added rows
new_df = df.reindex(index = range(insert_rows * len(df)))
If you provided more information that would be helpful, but a thing that comes to mind is to use this command
df.append(pd.Series(), ignore_index=True)
This will add an empty row to your data frame, though as you can see you have to pass set ignore_index=True, otherwise the append won't work.
The code below includes a function to add empty rows between the existing rows of a dataframe.
Might not be the best approach for what you want to do, it might be better to add the blank rows when you are exporting the data.
import pandas as pd
def add_blank_rows(df, no_rows):
df_new = pd.DataFrame(columns=df.columns)
for idx in range(len(df)):
df_new = df_new.append(df.iloc[idx])
for _ in range(no_rows):
df_new=df_new.append(pd.Series(), ignore_index=True)
return df_new
df = pd.read_csv('test.csv')
df_with_blank_rows = add_blank_rows(df, 3)
print(df_with_blank_rows)
this works
df_new = pd.DataFrame()
for i, row in df.iterrows():
df_new = df_new.append(row)
for _ in range(3):
df_new = df_new.append(pd.Series(), ignore_index=True)
df of course is the original DataFrame
Here is a function to do that with one loop:
def NAN_rows(df):
row = df.shape[0]
x = np.empty((3,2,)) # 3 empty row and 2 columns. You can increase according to your original df
x[:] = np.nan
df_x = pd.DataFrame( columns = ['name' ,'profession'])
for i in range(row):
temp = np.vstack([df.iloc[i].tolist(),x])
df_x = pd.concat([df_x, pd.DataFrame(temp,columns = ['name' ,'profession'])], axis=0)
return df_x
df = pd.DataFrame({
'name' : ['Bill','Sam','Adam'],
'profession' : ['cashier','stock','security']
})
print(NAN_rows(df))
#Output:
name profession
0 Bill cashier
1 nan nan
2 nan nan
3 nan nan
0 Sam stock
1 nan nan
2 nan nan
3 nan nan
0 Adam security
1 nan nan
2 nan nan
3 nan nan
Related
Consider the following dataframe:
df = pd.DataFrame(columns=['[mg]', '[mg] true'], index=range(3))
To filter for the column ending in ], one may use:
print(df.filter(regex="\]$"))
[mg]
0 NaN
1 NaN
2 NaN
Next, consider a hierarchical columns dataframe:
df1 = pd.DataFrame(columns=pd.MultiIndex.from_product([[0,1], ['[mg]', '[mg] true']]), index=range(3))
print(df1)
0 1
[mg] [mg] true [mg] [mg] true
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
When I again attempt to filter for the same columns ending in ], it now fails:
print(df1.filter(regex="\]$"))
Empty DataFrame
Columns: []
Index: [0, 1, 2]
Why does this fail, and what can I do to obtain the filtering I desire?
One option is to use str.contains on the get_level_values from columns then use loc to use the column index:
import pandas as pd
df1 = pd.DataFrame(
columns=pd.MultiIndex.from_product([[0, 1], ['[mg]', '[mg] true']]),
index=range(3))
# Apply Regex to Level 1 Of the Column Index
matches = df1.columns.get_level_values(1).str.contains(r"\]$")
# Filter Using loc
filtered_df = df1.loc[:, matches]
print(filtered_df)
filtered_df:
0 1
[mg] [mg]
0 NaN NaN
1 NaN NaN
2 NaN NaN
Interesting question. Observing the pandas source code for .filter, pandas will supply the strings from Dataframe._get_axis(1) to the regex. In this case, these are tuples (in string form):
MultiIndex([(0, '[mg]'),
(0, '[mg] true'),
(1, '[mg]'),
(1, '[mg] true')],
)
So to match only [mg] we can modify the regex to contain the final '):
print(df1.filter(regex=r"mg\]\'\)$"))
Prints:
0 1
[mg] [mg]
0 NaN NaN
1 NaN NaN
2 NaN NaN
NOTE: Probably it's very implementation dependent. So don't do it :)
Currently I'm extracting data from pdf's and putting it in a csv file. I'll explain how this works.
First I create an empty dataframe:
ndataFrame = pandas.DataFrame()
Then I read the data. Assume for simplicity reasons the data is the same for each pdf:
data = {'shoe': ['a', 'b'], 'fury': ['c','d','e','f'], 'chaos': ['g','h']}
dataFrame = pandas.DataFrame({k:pandas.Series(v) for k, v in data.items()})
Then I append this data to the empty dataframe:
ndataFrame = ndataFrame.append(dataFrame)
The is the output:
shoe fury chaos
0 a c g
1 b d h
2 NaN e NaN
3 NaN f NaN
However, now comes the issue. I need some columns (let's say 4) to be empty between the columns fury and chaos. This is my desired output:
shoe fury chaos
0 a c g
1 b d h
2 NaN e NaN
3 NaN f NaN
I tried some stuff with reindexing but I couldn't figure it out. Any help is welcome.
By the way, my desired output might be confusing. To be clear, I need some columns to be completely empty between fury and chaos(this is because some other data goes in there manually).
Thanks for reading
This answer assumes you have no way to change the way the data is being read in upstream. As always, it is better to handle these types of formatting changes at the source. If that is not possible, here is a way to do it after parsing.
You can use reindex here, using numpy.insert to add your four columns:
dataFrame.reindex(columns=np.insert(dataFrame.columns, 2, [1,2,3,4]))
shoe fury 1 2 3 4 chaos
0 a c NaN NaN NaN NaN g
1 b d NaN NaN NaN NaN h
2 NaN e NaN NaN NaN NaN NaN
3 NaN f NaN NaN NaN NaN NaN
I've two dataframes in pandas as shown below. EmpID is a primary key in both dataframes.
df_first = pd.DataFrame([[1, 'A',1000], [2, 'B',np.NaN],[3,np.NaN,3000],[4, 'D',8000],[5, 'E',6000]], columns=['EmpID', 'Name','Salary'])
df_second = pd.DataFrame([[1, 'A','HR','Delhi'], [8, 'B','Admin','Mumbai'],[3,'C','Finance',np.NaN],[9, 'D','Ops','Banglore'],[5, 'E','Programming',np.NaN],[10, 'K','Analytics','Mumbai']], columns=['EmpID', 'Name','Department','Location'])
I want to join these two dataframes with EmpID so that
Missing data in one dataframe can be filled with value from another table if exists and key matches
If there are observations with new keys then they should be appended in the resulting dataframe
I've used below code for achieving this.
merged_df = pd.merge(df_first,df_second,how='outer',on=['EmpID'])
But this code gives me duplicate columns which I don't want so I only used unique columns from both tables for merging.
ColNames = list(df_second.columns.difference(df_first.columns))
ColNames.append('EmpID')
merged_df = pd.merge(df_first,df_second,how='outer',on=['EmpID'])
Now I don't get duplicate columns but don't get value either in observations where key matches.
I'll really appreciate if someone can help me with this.
Regards,
Kailash Negi
It seems you need combine_first with set_index for match by indices created by columns EmpID:
df = df_first.set_index('EmpID').combine_first(df_second.set_index('EmpID')).reset_index()
print (df)
EmpID Department Location Name Salary
0 1 HR Delhi A 1000.0
1 2 NaN NaN B NaN
2 3 Finance NaN C 3000.0
3 4 NaN NaN D 8000.0
4 5 Programming NaN E 6000.0
5 8 Admin Mumbai B NaN
6 9 Ops Banglore D NaN
7 10 Analytics Mumbai K NaN
EDIT:
For some order of columns need reindex:
#concatenate all columns names togetehr and remove dupes
ColNames = pd.Index(np.concatenate([df_second.columns, df_first.columns])).drop_duplicates()
print (ColNames)
Index(['EmpID', 'Name', 'Department', 'Location', 'Salary'], dtype='object')
df = (df_first.set_index('EmpID')
.combine_first(df_second.set_index('EmpID'))
.reset_index()
.reindex(columns=ColNames))
print (df)
EmpID Name Department Location Salary
0 1 A HR Delhi 1000.0
1 2 B NaN NaN NaN
2 3 C Finance NaN 3000.0
3 4 D NaN NaN 8000.0
4 5 E Programming NaN 6000.0
5 8 B Admin Mumbai NaN
6 9 D Ops Banglore NaN
7 10 K Analytics Mumbai NaN
I have a dataframe that I would like to add a Mean column to for every row, but excludes the first column 'Dept'. So for example row 0 should have the 45.007000 instead of NaN.
df2 = df[MatchesWithDept].copy()
df2 = df2.replace(-999.250000, np.NaN)
df2 = df2.assign(Master_GR=df2.loc[:, Matches[:]].mean())
DEPT GRD GRR Master_GR
0 400.0 45.007000 NaN NaN
1 400.5 42.575001 NaN NaN
2 401.0 43.755001 NaN NaN
3 401.5 45.417000 NaN NaN
4 402.0 47.519001 NaN NaN
You can drop first column before mean:
df['Master_GR'] = df.drop('DEPT', axis=1).mean(axis=1)
Or select all columns without first by iloc:
df['Master_GR'] = df.iloc[:, 1:].mean(axis=1)
I want to create a new column called 'test' in my dataframe that is equal to the sum of all the columns starting from column number 9 to the end of the dataframe. These columns are all datatype float.
Below is the code I tried but it didn't work --> gives me back all NaN values in 'test' column:
df_UBSrepscomp['test'] = df_UBSrepscomp.iloc[:, 9:].sum()
If I'm understanding your question, you want the row-wise sum starting at column 9. I believe you want .sum(axis=1). See an example below using column 2 instead of 9 for readability.
df = DataFrame(npr.rand(10, 5))
df.iloc[0:3, 0:4] = np.nan # throw in some na values
df.loc[:, 'test'] = df.iloc[:, 2:].sum(axis=1); print(df)
0 1 2 3 4 test
0 NaN NaN NaN NaN 0.73046 0.73046
1 NaN NaN NaN NaN 0.79060 0.79060
2 NaN NaN NaN NaN 0.53859 0.53859
3 0.97469 0.60224 0.90022 0.45015 0.52246 1.87283
4 0.84111 0.52958 0.71513 0.17180 0.34494 1.23187
5 0.21991 0.10479 0.60755 0.79287 0.11051 1.51094
6 0.64966 0.53332 0.76289 0.38522 0.92313 2.07124
7 0.40139 0.41158 0.30072 0.09303 0.37026 0.76401
8 0.59258 0.06255 0.43663 0.52148 0.62933 1.58744
9 0.12762 0.01651 0.09622 0.30517 0.78018 1.18156