pandas - partially updating DataFrame with derived calculations of a subset groupby - python

I have a DataFrame with some NaN records that I want to fill based on a combination of data of the NaN record (index in this example) and of the non-NaN records. The original DataFrame should be modified.
Details of input/output/code below:
I have an initial DataFrame that contains some pre-calculated data:
Initial Input
raw_data = {'raw':[x for x in range(5)]+[np.nan for x in range(2)]}
source = pd.DataFrame(raw_data)
raw
0 0.0
1 1.0
2 2.0
3 3.0
4 4.0
5 NaN
6 NaN
I want to identify and perform calculations to "update" the NaN data, where the calculations are based on data of the non-NaN data and some data from the NaN records.
In this contrived example I am calculating this as:
Calculate average/mean of 'valid' records.
Add this to the index number of 'invalid' records.
Finally this needs to be updated on the initial DataFrame.
Desired Output
raw valid
0 0.0 1
1 1.0 1
2 2.0 1
3 3.0 1
4 4.0 1
5 7.0 0
6 8.0 0
The current solution I have (below) makes a calculation on a copy then updates the original DataFrame.
# Setup grouping by NaN in 'raw'
source['valid'] = ~np.isnan(source['raw'])*1
subsets = source.groupby('valid')
# Mean of 'valid' is used later to fill 'invalid' records
valid_mean = subsets.get_group(1)['raw'].mean()
# Operate on a copy of group(0), then update the original DataFrame
invalid = subsets.get_group(0).copy()
invalid['raw'] = subsets.get_group(0).index + valid_mean
source.update(invalid)
Is there a less clunky or more efficient way to do this? The real application is on significantly larger DataFrames (and with a significantly longer process of processing NaN rows).
Thanks in advance.

You can use combine_first:
#mean by default omit `NaN`s
m = source['raw'].mean()
#same as
#m = source['raw'].dropna().mean()
print (m)
2.0
#create valid column if necessary
source['valid'] = source['raw'].notnull().astype(int)
#update NaNs
source['raw'] = source['raw'].combine_first(source.index.to_series() + m)
print (source)
raw valid
0 0.0 1
1 1.0 1
2 2.0 1
3 3.0 1
4 4.0 1
5 7.0 0
6 8.0 0

Related

Read in .dat file with headers throughout

I'm trying to read in a .dat file but it's comprised of chunks of non-columnular data with headers throughout.
I've tried reading it in in pandas:
new_df = pd.read_csv(os.path.join(pathname, item), delimiter='\t', skiprows = 2)
And it helpfully comes out like this:
Cyclic Acquisition Unnamed: 1 Unnamed: 2 24290-15 Y Unnamed: 4 \
0 Stored at: 100 cycle NaN NaN
1 Points: 2 NaN NaN NaN
2 Ch 2 Displacement Ch 2 Force Time Ch 2 Count NaN
3 in lbf s segments NaN
4 -0.036677472 -149.27879 19.976563 198 NaN
5 0.031659406 149.65636 20.077148 199 NaN
6 Cyclic Acquisition NaN NaN 24290-15 Y NaN
7 Stored at: 200 cycle NaN NaN
8 Points: 2 NaN NaN NaN
9 Ch 2 Displacement Ch 2 Force Time Ch 2 Count NaN
10 in lbf s segments NaN
11 -0.036623772 -149.73801 39.975586 398 NaN
12 0.031438459 149.48193 40.078125 399 NaN
13 Cyclic Acquisition NaN NaN 24290-15 Y NaN
14 Stored at: 300 cycle NaN NaN
Do I need to resort to .genfromtext() or is there a panda-riffic way to accomplish this?
I developed a work-around. I needed the Displacement data pairs, as well as some data that was all divisible evenly by 100.
To get to the Displacement data, I first pretended 'Cyclic Acquisition' was a valid column name, coerced errors on the values forced to be numeric and forced the values included to be just the ones that worked out to numbers:
displacement = new_df['Cyclic Acquisition'][pd.to_numeric(new_df['Cyclic Acquisition'], errors='coerce').notnull()]
4 -0.036677472
5 0.031659406
11 -0.036623772
12 0.031438459
Then, because the chunks remaining were paired low and high values that needed to be operated on together, I selected every other value for the "low" values starting with the 0th value, and the same logic for the "high" values. I reset the index because my plan was to create a different DataFrame with the necessary info in it and I wanted it to keep values in appropriate relationship to each other.
displacement_low = displacement[::2].reset_index(drop = True)
0 -0.036677472
1 -0.036623772
displacement_high = displacement[1::2].reset_index(drop = True)
0 0.031659406
1 0.031438459
Then, to get the cycles, I followed the same basic principle to get that column down to just numbers, then I put the values into a list and used a list comprehension to require the divisibility, and switched it back to a Series.
cycles = new_df['Unnamed: 1'][pd.to_numeric(new_df['Unnamed: 1'], errors='coerce').notnull()].astype('float').tolist()
[100.0, 2.0, -149.27879, 149.65636, 200.0, 2.0, -149.73801, 149.48193...]
cycles = pd.Series([val for val in cycles if val%100 == 0])
0 100.0
1 200.0
...
I then created a new df with that data and named the columns as desired:
df = pd.concat([displacement_low, displacement_high, cycles], axis = 1)
df.columns = ['low', 'high', 'cycles']
low high cycles
0 -0.036677 0.031659 100.0
1 -0.036624 0.031438 200.0

Trying to fill NaNs with fillna() and groupby()

So I basically have an Airbnb data set with a few columns. Several of them correspond to ratings of different parameters (cleanliness, location,etc). For those columns I have a bunch of NaNs that I want to fill.
As some of those NaNs correspond to listings from the same owner, I wanted to fill some of the NaNs with the corresponding hosts' rating average for each of those columns.
For example, let's say that for host X, the average value for review_scores_location is 7. What I want to do is, in the review_scores_location column, fill all the NaN values, that correspond to the host X, with 7.
I've tried the following code:
cols=['reviews_per_month','review_scores_rating','review_scores_accuracy','review_scores_cleanliness','review_scores_checkin','review_scores_communication','review_scores_location','review_scores_value']
for i in cols:
airbnb[i]=airbnb[i].fillna(airbnb.groupby('host_id')[i].mean())
Although it does run and it doesn't return any error, it does not fill the NaN values, since when I check if there are still any NaNs, the amount hasn't changed.
What am I doing?
Thanks for taking the time to read this!
The problem here is that when using the series airbnb.groupby('host_id')[i].mean() in the fillna, the function tries to align index and as the index of airbnb.groupby('host_id')[i].mean() are actually the values of the column host_id and not the original index values of airbnb, the fillna does not work as you expect. Several options are possible to do the job, one way is to use transform after the groupby that will align the mean value per group to the original index values and then the fillna would work as expected, such as:
for i in cols:
airbnb[i]=airbnb[i].fillna(airbnb.groupby('host_id')[i].transform('mean'))
And even, you can use this method without a loop such as:
airbnb = airbnb.fillna(airbnb.groupby('host_id')[cols].transform('mean'))
with an example:
airbnb = pd.DataFrame({'host_id':[1,1,1,2,2,2],
'reviews_per_month':[4,5,np.nan,9,3,5],
'review_scores_rating':[3,np.nan,np.nan,np.nan,7,8]})
print (airbnb)
host_id review_scores_rating reviews_per_month
0 1 3.0 4.0
1 1 NaN 5.0
2 1 NaN NaN
3 2 NaN 9.0
4 2 7.0 3.0
5 2 8.0 5.0
and you get:
cols=['reviews_per_month','review_scores_rating'] # would work with all your columns
print (airbnb.fillna(airbnb.groupby('host_id')[cols].transform('mean')))
host_id review_scores_rating reviews_per_month
0 1 3.0 4.0
1 1 3.0 5.0
2 1 3.0 4.5
3 2 7.5 9.0
4 2 7.0 3.0
5 2 8.0 5.0

Python Pandas: How to merge based on an "OR" condition?

Let's say I have two dataframes, and the column names for both are:
table 1 columns:
[ShipNumber, TrackNumber, ShipDate, Quantity, Weight]
table 2 columns:
[ShipNumber, TrackNumber, AmountReceived]
I want to merge the two tables based on both ShipNumber and TrackNumber.
However, if i simply use merge in the following way (pseudo code, not real code):
tab1.merge(tab2, "left", on=['ShipNumber','TrackNumber'])
then, that means the values in both ShipNumber and TrackNumber columns from both tables MUST MATCH.
However, in my case, sometimes the ShipNumber column values will match, sometimes the TrackNumber column values will match; as long as one of the two values match for a row, I want the merge to happen.
In other words, if row 1 ShipNumber in tab 1 matches row 3 ShipNumber in tab 2, but the TrackNumber in two tables for the two records do not match, I still want to match the two rows from the two tables.
So basically this is a either/or match condition (pesudo code):
if tab1.ShipNumber == tab2.ShipNumber OR tab1.TrackNumber == tab2.TrackNumber:
then merge
I hope my question makes sense...
Any help is really really appreciated!
As suggested, I looked into this post:
Python pandas merge with OR logic
But it is not completely the same issue I think, as the OP from that post has a mapping file, and so they can simply do 2 merges to solve this. But I dont have a mapping file, rather, I have two df's with same key columns (ShipNumber, TrackNumber)
Use merge() and concat(). Then drop any duplicate cases where both A and B match (thanks #Scott Boston for that final step).
df1 = pd.DataFrame({'A':[3,2,1,4], 'B':[7,8,9,5]})
df2 = pd.DataFrame({'A':[1,5,6,4], 'B':[4,1,8,5]})
df1 df2
A B A B
0 3 7 0 1 4
1 2 8 1 5 1
2 1 9 2 6 8
3 4 5 3 4 5
With these data frames we should see:
df1.loc[0] matches A on df2.loc[0]
df1.loc[1] matches B on df2.loc[2]
df1.loc[3] matches both A and B on df2.loc[3]
We'll use suffixes to keep track of what matched where:
suff_A = ['_on_A_match_1', '_on_A_match_2']
suff_B = ['_on_B_match_1', '_on_B_match_2']
df = pd.concat([df1.merge(df2, on='A', suffixes=suff_A),
df1.merge(df2, on='B', suffixes=suff_B)])
A A_on_B_match_1 A_on_B_match_2 B B_on_A_match_1 B_on_A_match_2
0 1.0 NaN NaN NaN 9.0 4.0
1 4.0 NaN NaN NaN 5.0 5.0
0 NaN 2.0 6.0 8.0 NaN NaN
1 NaN 4.0 4.0 5.0 NaN NaN
Note that the second and fourth rows are duplicate matches (for both data frames, A = 4 and B = 5). We need to remove one of those sets.
dups = (df.B_on_A_match_1 == df.B_on_A_match_2) # also could remove A_on_B_match
df.loc[~dups]
A A_on_B_match_1 A_on_B_match_2 B B_on_A_match_1 B_on_A_match_2
0 1.0 NaN NaN NaN 9.0 4.0
0 NaN 2.0 6.0 8.0 NaN NaN
1 NaN 4.0 4.0 5.0 NaN NaN
I would suggest this alternate way for doing merge like this. This seems easier for me.
table1["id_to_be_merged"] = table1.apply(
lambda row: row["ShipNumber"] if pd.notnull(row["ShipNumber"]) else row["TrackNumber"], axis=1)
You can add the same column in table2 as well if needed and then use in left_in or right_on based on your requirement.

zero index for all rows in python dataframe

I have problem with indexing python dataframe. I have dataframe which I fill it with loop. I simplified it like this :
d = pd.DataFrame(columns=['img', 'time', 'key'])
for i in range(5):
image = i
timepoint = i+1
key = i+2
temp = pd.DataFrame({'img':[image], 'timepoint':[timepoint], 'key': [key]})
d = pd.concat([d, temp])
The problem is since it shows 0 as and index for all rows, I can not access to the specific row based on .loc[]. Does anybody have any idea how can I fix the problem and get normal index column?
You may want to use the ignore_index parameter in your concatenation :
d = pd.concat([d, temp], ignore_index=True)
This gives me the following result :
img key time timepoint
0 0.0 2.0 NaN 1.0
1 1.0 3.0 NaN 2.0
2 2.0 4.0 NaN 3.0
3 3.0 5.0 NaN 4.0
4 4.0 6.0 NaN 5.0
d = d.reset_index(drop=True)
PS: It's better practice to make a list of rows and then turn it into a DataFrame, much less computationally expensive and it will make a good index instantly.
This list could be a list of lists combined with the columns in your DataFrame init or a list of dictionaries with column names as keys. In your case:
list_of_dicts = []
for i in range(5):
new_row = {'img': i, 'time': i+1, 'key': i+2}
list_of_dicts.append(new_row)
d = pd.DataFrame(new_row)
I think better is first fill lists by values and then once use DataFrame constructor:
image, timepoint, key = [],[],[]
for i in range(5):
image.append(i)
timepoint.append(i+1)
key.append(i+2)
d = pd.DataFrame({'img':image, 'time':timepoint, 'key': key})
print (d)
img key time
0 0 2 1
1 1 3 2
2 2 4 3
3 3 5 4
4 4 6 5

Pandas manipulate dataframe

I am querying a database and populating a pandas dataframe. I am struggling to aggregate the data (via groupby) and then manipulate the dataframe index such that the dates in the table become the index.
Here is an example of how the data looks like before and after the groupby and what I ultimately am looking for.
dataframe - populated data
firm | dates | received | Sent
-----------------------------------------
A 10/08/2016 2 8
A 12/08/2016 4 2
B 10/08/2016 1 0
B 11/08/2016 3 5
A 13/08/2016 5 1
C 14/08/2016 7 3
B 14/08/2016 2 5
First I want to Group By "firm" and "dates" and "received/sent".
Then manipulate the DataFrame such that the dates becomes the index - rather than the row-index.
Finally to add a total column for each day
Some of the firms do not have 'activity' during some days or at least no activity in either received or sent. However as I want a view of the past X days back, empty values aren't possible rather I need to fill in a zero as a value instead.
dates | 10/08/2016 | 11/08/2016| 12/08/2016| 13/08/2016| 14/08/2016
firm |
----------------------------------------------------------------------
A received 2 0 4 5 0
sent 8 0 2 1 0
B received 1 3 1 0 2
sent 0 5 0 0 5
C received 0 0 2 0 1
sent 0 0 1 2 0
Totals r. 3 3 7 5 3
Totals s. 8 0 3 3 5
I've tried the following code:
df = > mysql query result
n_received = df.groupby(["firm", "dates"
]).received.size()
n_sent = df.groupby(["firm", "dates"
]).sent.size()
tables = pd.DataFrame({ 'received': n_received, 'sent': n_sent,
},
columns=['received','sent'])
this = pd.melt(tables,
id_vars=['dates',
'firm',
'received', 'sent']
this = this.set_index(['dates',
'firm',
'received', 'sent'
'var'
])
this = this.unstack('dates').fillna(0)
this.columns = this.columns.droplevel()
this.columns.name = ''
this = this.transpose()
Basically, I am not getting to the result I want based on this code.
- How can I achieve this?
- Conceptually is there a better way of achieving this result ? Say aggregating in the SQL statement or does the aggregation in Pandas make more sense from an optimisation point of view and logically.
You can use stack(unstack) to transform data from long to wide(wide to long) format:
import pandas as pd
# calculate the total received and sent grouped by dates
df1 = df.drop('firm', axis = 1).groupby('dates').sum().reset_index()
# add total category as the firm column
df1['firm'] = 'total'
# concatenate the summary data frame and original data frame use stack and unstack to
# transform the data frame so that dates appear as columns while received and sent stack as column.
pd.concat([df, df1]).set_index(['firm', 'dates']).stack().unstack(level = 1).fillna(0)
# dates 10/08/2016 11/08/2016 12/08/2016 13/08/2016 14/08/2016
# firm
# A Sent 8.0 0.0 2.0 1.0 0.0
# received 2.0 0.0 4.0 5.0 0.0
# B Sent 0.0 5.0 0.0 0.0 5.0
# received 1.0 3.0 0.0 0.0 2.0
# C Sent 0.0 0.0 0.0 0.0 3.0
# received 0.0 0.0 0.0 0.0 7.0
# total Sent 8.0 5.0 2.0 1.0 8.0
# received 3.0 3.0 4.0 5.0 9.0

Categories

Resources