enter image description here
enter image description here
I am trying to add rows where there is a gap between month_count. For example, row 0 has month_count = 0 and row 1 has month_count = 7. How can I add extra 6 rows with month counts being 1,2,3,4,5,6? Also, same situation from row 3 to row 4. I would like to add 2 extra rows with month_count 10 and 11. What is the best way to go about this?
One way to do this would be to iterate over all of the rows and re-build the DataFrame with the missing rows inserted. Pandas does not support the direct insertion of rows at an index, however you can hack together a solution using pd.concat():
def pandas_insert(df, idx, row_contents):
top = df.iloc[:idx]
bot = df.iloc[idx:]
inserted = pd.concat([top, row_contents, bot], ignore_index=True)
return inserted
Here row_contents should be a DataFrame with one (or more) rows. We use ignore_index=True to update the index of the new DataFrame to be labeled 0,1, …, n-2, n-1
Related
Hi I am trying to edit rows for a particular column 'PM2.5' using '''df.at['0', 'PM2.5'] = 10''' for first row of the PM2.5 column but instead of editing, it is adding a new row. The headers for the columns are defined by titles but my rows are numbered, how do I go around this? I want to do this for 18 rows and manually add data to the column PM2.5. Thanks!
After comment problem is index values are integers (RangeIndex), so for set values need integer too.
So change '0' (string)
df.at['0', 'PM2.5'] = 10
to 0 (integer):
df.at[0, 'PM2.5'] = 10
I am trying to pivot a pandas dataframe, but the data is following a strange format that I cannot seem to pivot. The data is structured as below:
Date, Location, Action1, Quantity1, Action2, Quantity2, ... ActionN, QuantityN
<date> 1 Lights 10 CFloor 1 ... Null Null
<date2> 2 CFloor 2 CWalls 4 ... CBasement 15
<date3> 2 CWalls 7 CBasement 4 ... NUll Null
Essentially, each action will always have a quantity attached to it (which may be 0), but null actions will never have a quantity (the quantity will just be null). The format I am trying to achieve is the following:
Lights CFloor CBasement CWalls
1 10 1 0 0
2 0 2 19 11
The index of the rows becomes the location while the columns become any unique action found across the multiple activity columns. When pulling the data together, the value of each row/column is the sum of each quantity associated with the action (i.e Action1 corresponds to Quantity1). Is there a way to do this with the native pandas pivot funciton?
My current code performs a ravel across all the activity columns to get a list of all unique activities. It will also grab all the unique locations from the Location column. Once I have the unique columns, I create an empty dataframe and fill it with zeros:
Lights CFloor CBasement CWalls
1 0 0 0 0
2 0 0 0 0
I then iterate back over the old data frame with the itertuples() method (I was told it was significantly faster than iterrows()) and populate the new dataframe. This empty dataframe acts as a template that is stored in memory and filled later.
#Creates a template from the dataframe
def create_template(df):
act_cols = ['Activity01', 'Activity02', 'Activity03', 'Activity04']
activities = df[act_cols]
flat_acts = activities.values.ravel('K')
unique_locations = pd.unique(df['Location'])
unique_acts = pd.unique(flat_acts)
pivot_template = pd.DataFrame(index=unique_locations, columns=unique_acts).fillna(0)
return pivot_template
#Fills the template from the dataframe
def create_pivot(df, pivot_frmt):
act_cols = ['Activity01', 'Activity02', 'Activity03', 'Activity04']
quant_cols = ['Quantity01', 'Quantity02', 'Quantity03', 'Quantity04']
for row in df.itertuples():
for act, quantity in zip(act_cols, quant_cols):
act_val = getattr(row, act)
if pd.notna(act_val):
quantity_val = getattr(row, quantity)
location = getattr(row, 'Location')
pivot_frmt.loc[location, act_val] += quantity_val
return pivot_frmt
While my solution works, it is incredibly slow when dealing with a large dataset and has taken 10 seconds or more to complete this type of operation. Any help would be greatly appreciated!
After experimenting with various pandas functions, such as melt and pivot on multiple columns simulatenously, I found a solution that worked for me:
For every quantity-activity pair, I build a partial frame of the final dataset and store it in a list. Once every pair has been addressed I will end up with multiple dataframes that all have the same row counts, but potentially different column counts. I solved this issue by simply concatenating the columns and if any columns are repeated, I then sum them to get the final result.
def test_pivot(df):
act_cols = ['Activity01', 'Activity02', 'Activity03', 'Activity04']
quant_cols = ['Quantity01', 'Quantity02', 'Quantity03', 'Quantity04']
dfs = []
for act, quant in zip(act_cols, quant_cols):
partial = pd.crosstab(index=df['Location'], columns=df[act], values=df[quant], aggfunc=np.sum).fillna(0)
dfs.append(partial)
finalDf = pd.concat(dfs, axis=1)
finalDf = test.groupby(finalDf.columns, axis=1).sum()
return finalDf
There are two assumptions that I make during this approach:
The indexes maintain their order across all partial dataframes
There are an equivalent number of indexes across all partial dataframes
While this is probably not the most elegant solution, it achieves the desired result and reduced the time it took to process the data by a very significant margin (from 10s ~4k rows to 0.2s ~4k rows). If anybody has a better way to deal with this type of scenario and do the process outlined above in one shot, then I would love to see your response!
I have data as shown below. I would like to select rows based on two conditions.
1) rows that start with digits (1,2,3 etc)
2) previous row of the records that satisfy 1st condition
Please find the how the input data looks like
Please find how I expect the output to be
I tried using the shift(-1) function but it seems to be throwing error. I am sure I messed up with the logic/syntax. Please find the code below that I tried
# i get the index of all records that start with number.
s=df1.loc[df1['VARIABLE'].str.contains('^\d')==True].index
# now I need to get the previous record of each group but this is
#incorrect
df1.loc[((df1['VARIABLE'].shift(-1).str.contains('^\d')==False) &
(df1['VARIABLE'].str.contains('^\d')==True))].index
Use:
df1 = pd.DataFrame({'VARIABLE':['studyid',np.nan,'age_interview','Gender','1.Male',
'2.Female',np.nan, 'dob', 'eth',
'Ethnicity','1.Chinese','2.Indian','3.Malay']})
#first remove missing rows by column VARIABLE
df1 = df1.dropna(subset=['VARIABLE'])
#test startinf numbers
s = (df1['VARIABLE'].str.contains('^\d'))
#chain shifted values by | for OR
mask = s | s.shift(-1)
#filtering by boolean indexing
df1 = df1[mask]
print (df1)
VARIABLE
3 Gender
4 1.Male
5 2.Female
9 Ethnicity
10 1.Chinese
11 2.Indian
12 3.Malay
I am trying to find the rows, in a very large dataframe, with the highest mean.
Reason: I scan something with laser trackers and used a "higher" point as reference to where the scan starts. I am trying to find the object placed, through out my data.
I have calculated the mean of each row with:
base = df.mean(axis=1)
base.columns = ['index','Mean']
Here is an example of the mean for each row:
0 4.407498
1 4.463597
2 4.611886
3 4.710751
4 4.742491
5 4.580945
This seems to work fine, except that it adds an index column, and gives out columns with an index of type float64.
I then tried this to locate the rows with highest mean:
moy = base.loc[base.reset_index().groupby(['index'])['Mean'].idxmax()]
This gives out tis :
index Mean
0 0 4.407498
1 1 4.463597
2 2 4.611886
3 3 4.710751
4 4 4.742491
5 5 4.580945
But it only re-index (I have now 3 columns instead of two) and does nothing else. It still shows all rows.
Here is one way without using groupby
moy=base.sort_values('Mean').tail(1)
It looks as though your data is a string or single column with a space in between your two numbers. Suggest splitting the column into two and/or using something similar to below to set the index to your specific column of interest.
import pandas as pd
df = pd.read_csv('testdata.txt', names=["Index", "Mean"], delimiter="\s+")
df = df.set_index("Index")
print(df)
I want to remove every row in my 7000 x 10 data frame where one of the row entries takes a certain value. For example, if I had 600 rows where '20' appeared in the row, how can I delete all of those?
You usually are better off just making a new dataframe fulfilling the conditions you need than editing around with the old one. When in doubt you can just always assign it back to the same name, but here's a minimal example:
value = 20
df_filtered = df[(df != value).all(axis=1)]
find the relevant rows and then create new_df from the rest
value = 20
rows_to_delete = df[df==value].any(axis=1)
new_df = df.loc[~rows_to_delete,:]