I'm collecting data over the course of many days and rather than filling it in for every day, I can elect to say that the data from one day should really be a repeat of another day. I'd like to repeat some of the rows from my existing data frame into the days specified as repeats. I have a column that indicates which day the current day is to repeat from but I am getting stuck with errors.
I have found ways to repeat rows n times based a column value but I am trying to use a column as an index to repeat data from previous rows.
I'd like to copy parts of my "Data" column for Day 1 into the "Data" column for Day 3 , using my "Repeat" Column as the index. I would like to do this for many more different days.
data = [['1', 5,np.NaN], ['1',5,np.NaN],['1',5,np.NaN], ['2', 6,np.NaN],['2', 6,np.NaN],['2', 6,np.NaN], ['3',np.NaN,1], ['3',np.NaN,np.NaN],['3', np.NaN,np.NaN]]
df = pd.DataFrame(data, columns = ['Day', 'Data','repeat_tag'])
I slightly extended your test data:
data = [['1', 51, np.nan], ['1', 52, np.nan], ['1', 53, np.nan],
['2', 61, np.nan], ['2', 62, np.nan], ['2', 63, np.nan],
['3', np.nan, 1], ['3', np.nan, np.nan], ['3', np.nan, np.nan],
['4', np.nan, 2], ['4', np.nan, np.nan], ['4', np.nan, np.nan]]
df = pd.DataFrame(data, columns = ['Day', 'Data', 'repeat_tag'])
Details:
There are 4 days with observations.
Each observation has different value (Data).
To avoid "single day copy", values for day '3' are to be copied from
day '1' and for day '4' from day '2'.
I assume that non-null value of repeat_tag can be placed in only one
observation for the "target" day.
I also added obsNo column to identify observations within particular day:
df['obsNo'] = df.groupby('Day').cumcount().add(1);
(it will be necessary later).
The first step of actual processing is to generate replDays table, where Day
column is the target day and repeat_tag is the source day:
replDays = df.query('repeat_tag.notnull()')[['Day', 'repeat_tag']]
replDays.repeat_tag = replDays.repeat_tag.astype(int).apply(str)
A bit of type manipulation was with repeat_tag column.
As this column contains NaN values and non-null values are int, this column is
coerced to float64. Hence, to get string type (comparable with Day) it
must be converted:
First to int, to drop the decimal part.
Then to str.
The result is:
Day repeat_tag
6 3 1
9 4 2
(fill data for day 3 with data from day 1 and data for day 4 with data from day 2).
The next step is to generate replData table:
replData = pd.merge(replDays, df, left_on='repeat_tag', right_on='Day',
suffixes=('_src', ''))[['Day_src', 'Day', 'Data', 'obsNo']]\
.set_index(['Day_src', 'obsNo']).drop(columns='Day')
The result is:
Data
Day_src obsNo
3 1 51.0
2 52.0
3 53.0
4 1 61.0
2 62.0
3 63.0
As you can see:
There is only one column of replacement data - Data (from day 1 and 2).
MutliIndex contains both the day and observation number (both will be
needed for proper update).
And the final part includes the following steps:
Copy df to res (result), setting index to Day and obsNo
(required for update).
Update this table with data from replData.
Move Day and obsNo from index back to "regular" columns.
The code is:
res = df.copy().set_index(['Day', 'obsNo'])
res.update(replData)
res.reset_index(inplace=True)
If you want, you can alse drop obsNo column.
And a remark concerning the solution by Peter:
If source data contains for any day different values, his code fails
with InvalidIndexError, probably due to lack of identification of
individual observations within particular day.
This confirms that my idea to add obsNo column is valid.
Setup
# Start with Valdi_Bo's expanded example data
data = [['1', 51, np.nan], ['1', 52, np.nan], ['1', 53, np.nan],
['2', 61, np.nan], ['2', 62, np.nan], ['2', 63, np.nan],
['3', np.nan, 1], ['3', np.nan, np.nan], ['3', np.nan, np.nan],
['4', np.nan, 2], ['4', np.nan, np.nan], ['4', np.nan, np.nan]]
df = pd.DataFrame(data, columns = ['Day', 'Data', 'repeat_tag'])
# Convert Day to integer data type
df['Day'] = df['Day'].astype(int)
# Spread repeat_tag values into all rows of tagged day
df['repeat_tag'] = df.groupby('Day')['repeat_tag'].ffill()
Solution
# Within each day, assign a number to each row
df['obs'] = df.groupby('Day').cumcount()
# Self-join
filler = (pd.merge(df, df,
left_on=['repeat_tag', 'obs'],
right_on=['Day', 'obs'])
.set_index(['Day_x', 'obs'])['Data_y'])
# Fill missing data
df = df.set_index(['Day', 'obs'])
df.loc[df['Data'].isnull(), 'Data'] = filler
df = df.reset_index()
Result
df
Day obs Data repeat_tag
0 1 0 51.0 NaN
1 1 1 52.0 NaN
2 1 2 53.0 NaN
3 2 0 61.0 NaN
4 2 1 62.0 NaN
5 2 2 63.0 NaN
6 3 0 51.0 1.0
7 3 1 52.0 1.0
8 3 2 53.0 1.0
9 4 0 61.0 2.0
10 4 1 62.0 2.0
11 4 2 63.0 2.0
Related
I have dataframe where column 'Score' is calculated from values in other columns. I would need to have missing value in Score column if any of other columns has missing value for that row.
df = pd.DataFrame({'Score': [71, 63, 23],
'Factor_1': [nan, '15', '23'],
'Factor_2': ['12', nan, '45'],
'Factor_3': ['3', '5', '7']})
Expected values for column Score: nan, nan, 23 (because Factor 1 is missing in 1st row and Factor 2 is missing in 2nd row). So, I should replace existing values with NAs.
Thank you for your help.
Use DataFrame.filter for Factor column, test if missing values by DataFrame.isna for at least one value per row by DataFrame.any and set NaN by DataFrame.loc:
df.loc[df.filter(like='Factor').isna().any(axis=1), 'Score'] = np.nan
Or use Series.mask:
df['Score'] = df['Score'].mask(df.filter(like='Factor').isna().any(axis=1))
If need explicit columns names:
cols = ['Factor_1', 'Factor_2', 'Factor_3']
df.loc[df[cols].isna().any(axis=1), 'Score'] = np.nan
df['Score'] = df['Score'].mask(df[cols].isna().any(axis=1))
print (df)
Score Factor_1 Factor_2 Factor_3
0 NaN NaN 12 3
1 NaN 15 NaN 5
2 23.0 23 45 7
I have this large dataframe, illustrated below is for simplicity purposes.
pd.DataFrame(df.groupby(['Pclass', 'Sex'])['Age'].median())
Groupby results:
And it have this data that needs to be imputed
Missing Data:
How can I impute these values based on the median of the grouped statistic
The result that I want is:
# You can use this for reference
import numpy as np
import pandas as pd
mldx_arrays = [np.array([1, 1,
2, 2,
3, 3]),
np.array(['male', 'female',
'male', 'female',
'male', 'female'])]
multiindex_df = pd.DataFrame(
[34,29,24,40,18,25], index=mldx_arrays,
columns=['Age'])
multiindex_df.index.names = ['PClass', 'Sex']
multiindex_df
d = {'PClass': [1, 1, 2, 2, 3, 3],
'Sex': ['male', 'female', 'male', 'female', 'male', 'female'],
'Age': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan]}
df = pd.DataFrame(data=d)
If all values are missing remove Age column and use DataFrame.join:
df = df.drop('Age', axis=1).join(multiindex_df, on=['PClass','Sex'])
print (df)
PClass Sex Age
0 1 male 34
1 1 female 29
2 2 male 24
3 2 female 40
4 3 male 18
5 3 female 25
If need replace only missing values use DataFrame.join and replace missing values in original column:
df = df.join(multiindex_df, on=['PClass','Sex'], rsuffix='_')
df['Age'] = df['Age'].fillna(df.pop('Age_'))
print (df)
PClass Sex Age
0 1 male 34.0
1 1 female 29.0
2 2 male 24.0
3 2 female 40.0
4 3 male 18.0
5 3 female 25.0
If need replace missing values by median per groups use GroupBy.transform:
df['Age'] = df['Age'].fillna(df.groupby(['PClass', 'Sex'])['Age'].transform('median'))
Given your example case you can simply assign the Series to the dataframe and re-define the column:
df['Age'] = base_df.groupby(['Pclass', 'Sex'])['Age'].median()
Otherwise you need to careful of positioning and in case it's not sorted you might want to use sort_index() or sort_values() first, depending on the case.
Is there special reason for filling NaN? If not, use reset_index on your result :
df = pd.read_csv('your_file_name.csv') # input your file name or url
df.groupby(['Pclass', 'Sex'])['Age'].median().reset_index()
I am writing a program where I want to count the number of columns in each row as each file has a different number of columns. It means I want to check if any row is missing a cell, and if it does, then I want to highlight the cell number.
I am using pandas for that to read the file. I have multiple gzip files which contain another CSV file.
My code for reading the files:
#running this under loop
data = pd.read_csv(files,
compression='gzip'
on_bad_lines='warn'
low_memory=False,
sep=r'|',
header=None,
na_values=['NULL',' ','NaN'],
keep_default_na = False
)
I checked StackOverflow but there's no answer related to this situation. I would be really glad if someone can help me out here.
Not sure if i'm interpreting this right but if you want to count the number of columns in each pandas dataframe within a loop, there are plenty of options.
1) data.shape[1]
2) len(data.columns)
3) len(list(data))
Here is a minimal reproducibility code. Replace "data = pd.DataFrame(dict)" with "data = pd.read_csv(...)"
# Import Required Libraries
import pandas as pd
import numpy as np
# Create dictionaries for the dataframe
dict1 = {'Name': ['Anne', 'Bob', 'Carl'],
'Age': [22, 20, 22],
'Marks': [90, 84, 82]}
dict2 = {'Name': ['Dan', 'Ely', 'Fan'],
'Age': [52, 30, 12],
'Marks': [40, 54, 42]}
for i in [dict1, dict2]:
# Read data
data = pd.DataFrame(dict1)
# Get columns
shape = data.shape # (3,3)
col = shape[1] # 3
# Printing Number of columns
print(f'Number of columns for file <>: {col}')
"This works fine, but after trying your suggestion I am getting the total number of columns that we have in our data frame. I want to print the number of columns each row contains. For eg: S.no Name 1 Adam 2 George 3 NULL so, 1st row will print 2, the second will be 2, but the third will print one."
– Ramoxx
Below is the updated answer(s) for your specification
Get counts of non-nulls for each row
data.apply(lambda x: x.count(), axis=1)
data:
A B C
0: 1 2 3
1: 2 nan nan
2: nan nan nan
output:
0: 3
1: 1
2: 0
Add counts of non-nulls for each row into dataframe
data['count'] = data.apply(lambda x: x.count(), axis=1)
result:
A B C count
0: 1 1 3 3
1: 2 nan nan 1
2: nan nan nan 0
Let's say I have my main DataFrame.
df = pd.DataFrame({'ID': [1,1,1,2,2,2,3,3,3],
'Date': ['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-01', '2021-01-02', '2021-01-03','2021-01-01', '2021-01-02', '2021-01-03'] ,
'Values': [11, np.nan, np.nan, 13, np.nan, np.nan, 15, np.nan, np.nan],
'Random_Col': [0,0,0,0,0,0,0,0,0]})
I want to fill the np.nan values with values from another dataframe that is not the same shape. The values have to match on "ID" and "Date".
new_df = pd.DataFrame({'ID': [1,1,2,2,3,3],
'Date': ['2021-01-02', '2021-01-03', '2021-01-02', '2021-01-03','2021-01-02','2021-01-03'],
'Values': [16, 19, 14, 14, 19, 18]})
What's the best way to do this?
I experimented with df.update(), but I'm not that works since the dataframes do not have the same number of rows. Am I wrong about this?
I could also use pd.merge(), but then I end up with multiple versions of each column and have to .fillna() for each specific column with the 2nd column with the new values. This would be fine if I only had 1 column of data to do this for, but I have dozens.
Is there a simpler way that I haven't considered?
One option is to merge + sort_index + bfill to fill the missing data in df, then reindex with df.columns. Since '\x00' has the lowest value, the sorting should place the same column names next to each other.
out = (df.merge(new_df, on=['ID','Date'], how='left', suffixes=('','\x00'))
.sort_index(axis=1).bfill(axis=1)[df.columns])
Output:
ID Date Values Random_Col
0 1 2021-01-01 11.0 0
1 1 2021-01-02 16.0 0
2 1 2021-01-03 19.0 0
3 2 2021-01-01 13.0 0
4 2 2021-01-02 14.0 0
5 2 2021-01-03 14.0 0
6 3 2021-01-01 15.0 0
7 3 2021-01-02 19.0 0
8 3 2021-01-03 18.0 0
I am working on a spend data where I want to see the last month when the spend was made in current and previous year. If there is no spend in these years, then, I assume the last month of spend as Dec-2020.
My data looks like this
As shown in the data the months are already there in the form of columns.
I want to create a new column last_txn_month which gives last month when the spend was made. So the output should look like this:
Let's say your DataFrame looks like:
df = pd.DataFrame([[1, np.nan, np.nan, 3, np.nan], [10, 11, 12, 13, 14],
[101, 102, np.nan, np.nan, np.nan],
[110, np.nan, np.nan, 111, np.nan]],
columns=[*'abcde'])
Then you could use notna to create boolean DataFrame; then apply a lambda function that filters the last column name of a non-NaN value for each row:
df['last'] = df.notna().apply(lambda x: df.columns[x][-1], axis=1)
Output:
a b c d e last
0 1 NaN NaN 3.0 NaN d
1 10 11.0 12.0 13.0 14.0 e
2 101 102.0 NaN NaN NaN b
3 110 NaN NaN 111.0 NaN d