I am stuck on a problem which looks simple but for which I cannot find a proper solution.
Consider a given Pandas dataframe df, composed by multiple columns A1,A2, etc., and let Ai be one of its column filled for example as follows:
Ai
25
30
30
NaN
12
15
15
NaN
I would like to delete all the rows in df for which Ai values are between NaN and a "further change" in its value, so that my output (for column Ai) would be:
Ai
25
NaN
12
NaN
Any idea on how to do so would be very much appreciated. Thank you very much in advance.
update
Similar to the previous solution but with a filter per group to keep the early duplicates
m = df['Ai'].isna()
df.loc[((m|m.shift(fill_value=True))
.groupby(df['Ai'].ne(df['Ai'].shift()).cumsum())
.filter(lambda d: d.sum()>0).index
)]
output:
Ai
0 25.0
1 25.0
2 25.0
5 NaN
6 30.0
7 30.0
9 NaN
original answer
This is equivalent to selecting the NaNs and line below. You could use a mask:
m = df['Ai'].isna()
df[m|m.shift(fill_value=True)]
Output:
Ai
0 25.0
3 NaN
4 12.0
7 NaN
Related
as part of some data cleansing, i want to add the mean of a variable back into a dataframe to use if the variable is missing for a particular observation. so i've calculated my averages as follows
avg=all_data2.groupby("portfolio")"[sales"].mean().reset_index(name="sales_mean")
now I wanted to add that back into my original dataframe using a left join, but it doesnt appear to be working. what format is my avg now? I thought it would be a dataframe but is it something else?
UPDATED
This is probably the most succinct way to do it:
all_data2.sales = all_data2.sales.fillna(all_data2.groupby('portfolio').sales.transform('mean'))
This is another way to do it:
all_data2['sales'] = all_data2[['portfolio', 'sales']].groupby('portfolio').transform(lambda x: x.fillna(x.mean()))
Output:
portfolio sales
0 1 10.0
1 1 20.0
2 2 30.0
3 2 40.0
4 3 50.0
5 3 60.0
6 3 NaN
portfolio sales
0 1 10.0
1 1 20.0
2 2 30.0
3 2 40.0
4 3 50.0
5 3 60.0
6 3 55.0
To answer your the part of your question that reads "what format is my avg now? I thought it would be a dataframe but is it something else?", avg is indeed a dataframe but using it may not be the most direct way to update missing data in the original dataframe. The dataframe avg looks like this for the sample input data above:
portfolio sales_mean
0 1 15.0
1 2 35.0
2 3 55.0
A related SO question that you may find helpful is here.
If you want to add a new column, you can use this code:
df['sales_mean']=df[['sales_1','sales_2']].mean(axis=1)
I have problem with to do code to sum up prices in terms of date column.
For example:
31-12-2019
There we have :
39,0 and 62.0
I want sum up it and get 101.0 in other column or under last date in new row
All data in tablewidget i have thanks for excel spreedsheet reader procedure:
df = pd.read_excel('test_cod.xlsx')
self.tableWidget_25.setRowCount(len(df.index))
for i in range(len(df.index)):
for j in range (len(df.columns)):
self.tableWidget_25.setItem(i,j,QTableWidgetItem(str(df.iat[i,j])))
test_cos.xlsx here: spreedsheet
I created a simple csv for this purpose, but you can easily apply this to your dataframe. The dataframe I used is this:
Date value total
0 24-11-2019 10 NaN
1 24-11-2019 20 NaN
2 24-11-2019 50 NaN
3 26-11-2019 14 NaN
4 26-11-2019 23 NaN
5 02-12-2019 15 NaN
6 02-12-2019 15 NaN
This is the solution I came up with. It might not be most optimal, as I have barely ever used pandas, but it works. It doesn't loop within a loop which I think your draft is doing, which is better in terms of performance.
import pandas as pd
df = pd.read_csv('test.csv')
compareVal = df.Date[0]
total=0
print(df)
for x, date in enumerate(df.Date):
if compareVal==date:
total += df.value[x]
else:
df.at[x-1,'total'] = total
total = df.value[x]
compareVal = df.Date[x]
if x == len(df)-1:
df.at[x,'total'] = total
print(df)
This is the output:
Date value total
0 24-11-2019 10 NaN
1 24-11-2019 20 NaN
2 24-11-2019 50 80.0
3 26-11-2019 14 NaN
4 26-11-2019 23 37.0
5 02-12-2019 15 NaN
6 02-12-2019 15 30.0
df1 = pd.DataFrame(np.arange(15).reshape(5,3))
df1.iloc[:4,1] = np.nan
df1.iloc[:2,2] = np.nan
df1.dropna(thresh=1 ,axis=1)
It seems that no nan value has been deleted.
0 1 2
0 0 NaN NaN
1 3 NaN NaN
2 6 NaN 8.0
3 9 NaN 11.0
4 12 13.0 14.0
if i run
df1.dropna(thresh=2,axis=1)
why it gives the following?
0 2
0 0 NaN
1 3 NaN
2 6 8.0
3 9 11.0
4 12 14.0
i just dont understand what thresh is doing here. If a column has more than one nan value, should the column be deleted?
thresh=N requires that a column has at least N non-NaNs to survive. In the first example, both columns have at least one non-NaN, so both survive. In the second example, only the last column has at least two non-NaNs, so it survives, but the previous column is dropped.
Try setting thresh to 4 to get a better sense of what's happening.
thresh parameter value decides the minimum number of non-NAN values needed in a "ROW" not to drop.
This will search along the column and check if the column has atleast 1 non-NaN values:
df1.dropna(thresh=1 ,axis=1)
So the Column name 1 has only one non-NaN value i.e 13 but thresh=2 need atleast 2 non-NaN, so this column failed and it will drop that column:
df1.dropna(thresh=2,axis=1)
The Dataframe consists of table, the format of which is shown in the Attached image. I apologize for not being able to type the format here as while trying to type the format of the Dataframe, it was getting messed up due to long decimal values, so i thought to attach its snapshot.
Country names are the index of the data frame and the cell values consists of corresponding GDP value. The intent is to calculate the average of all the rows for each country. When np.average was applied -
#name of Dataframe - GDP
def function_average()
GDP['Average'] = np.average(GDP.iloc[:,0:])
return GDP
function_average()
The new column got created reflecting all the values as NaN. I assumed its probably due to the inappropriately formatted cell values. I tried truncating that using the following code -
GDP = np.round(GDP, decimals =2)
And yet, there was no change in values. The code ran successfully though and there was no error.
Please advise, how to proceed in this case, should i try to make change in the spreadsheet itself or attempt to format cell values in Dataframe?
I regret for any inconvenience caused for not being able to provide any other required information at this point. please let me know if any other detail is required.
Problem is need axis=1 for count mean per rows and change function to numpy.nanmean or DataFrame.mean:
Sample:
np.random.seed(100)
GDP = pd.DataFrame(np.random.randint(10, size=(5,5)), columns=list('ABCDE'))
GDP.loc[0, 'A'] = np.nan
GDP['Average1'] = np.average(GDP.iloc[:,0:], axis=1)
GDP['Average2'] = np.nanmean(GDP.iloc[:,0:], axis=1)
GDP['Average3'] = GDP.iloc[:,0:].mean(axis=1)
print (GDP)
A B C D E Average1 Average2 Average3
0 NaN 8 3 7 7 NaN 6.25 6.25
1 0.0 4 2 5 2 2.6 2.60 2.60
2 2.0 2 1 0 8 2.6 2.60 2.60
3 4.0 0 9 6 2 4.2 4.20 4.20
4 4.0 1 5 3 4 3.4 3.40 3.40
You get NaN, because at least one NaN:
print (np.average(GDP.iloc[:,0:]))
nan
GDP['Average'] = np.average(GDP.iloc[:,0:])
print (GDP)
A B C D E Average
0 NaN 8 3 7 7 NaN
1 0.0 4 2 5 2 NaN
2 2.0 2 1 0 8 NaN
3 4.0 0 9 6 2 NaN
4 4.0 1 5 3 4 NaN
Let's say I have two dataframes, and the column names for both are:
table 1 columns:
[ShipNumber, TrackNumber, ShipDate, Quantity, Weight]
table 2 columns:
[ShipNumber, TrackNumber, AmountReceived]
I want to merge the two tables based on both ShipNumber and TrackNumber.
However, if i simply use merge in the following way (pseudo code, not real code):
tab1.merge(tab2, "left", on=['ShipNumber','TrackNumber'])
then, that means the values in both ShipNumber and TrackNumber columns from both tables MUST MATCH.
However, in my case, sometimes the ShipNumber column values will match, sometimes the TrackNumber column values will match; as long as one of the two values match for a row, I want the merge to happen.
In other words, if row 1 ShipNumber in tab 1 matches row 3 ShipNumber in tab 2, but the TrackNumber in two tables for the two records do not match, I still want to match the two rows from the two tables.
So basically this is a either/or match condition (pesudo code):
if tab1.ShipNumber == tab2.ShipNumber OR tab1.TrackNumber == tab2.TrackNumber:
then merge
I hope my question makes sense...
Any help is really really appreciated!
As suggested, I looked into this post:
Python pandas merge with OR logic
But it is not completely the same issue I think, as the OP from that post has a mapping file, and so they can simply do 2 merges to solve this. But I dont have a mapping file, rather, I have two df's with same key columns (ShipNumber, TrackNumber)
Use merge() and concat(). Then drop any duplicate cases where both A and B match (thanks #Scott Boston for that final step).
df1 = pd.DataFrame({'A':[3,2,1,4], 'B':[7,8,9,5]})
df2 = pd.DataFrame({'A':[1,5,6,4], 'B':[4,1,8,5]})
df1 df2
A B A B
0 3 7 0 1 4
1 2 8 1 5 1
2 1 9 2 6 8
3 4 5 3 4 5
With these data frames we should see:
df1.loc[0] matches A on df2.loc[0]
df1.loc[1] matches B on df2.loc[2]
df1.loc[3] matches both A and B on df2.loc[3]
We'll use suffixes to keep track of what matched where:
suff_A = ['_on_A_match_1', '_on_A_match_2']
suff_B = ['_on_B_match_1', '_on_B_match_2']
df = pd.concat([df1.merge(df2, on='A', suffixes=suff_A),
df1.merge(df2, on='B', suffixes=suff_B)])
A A_on_B_match_1 A_on_B_match_2 B B_on_A_match_1 B_on_A_match_2
0 1.0 NaN NaN NaN 9.0 4.0
1 4.0 NaN NaN NaN 5.0 5.0
0 NaN 2.0 6.0 8.0 NaN NaN
1 NaN 4.0 4.0 5.0 NaN NaN
Note that the second and fourth rows are duplicate matches (for both data frames, A = 4 and B = 5). We need to remove one of those sets.
dups = (df.B_on_A_match_1 == df.B_on_A_match_2) # also could remove A_on_B_match
df.loc[~dups]
A A_on_B_match_1 A_on_B_match_2 B B_on_A_match_1 B_on_A_match_2
0 1.0 NaN NaN NaN 9.0 4.0
0 NaN 2.0 6.0 8.0 NaN NaN
1 NaN 4.0 4.0 5.0 NaN NaN
I would suggest this alternate way for doing merge like this. This seems easier for me.
table1["id_to_be_merged"] = table1.apply(
lambda row: row["ShipNumber"] if pd.notnull(row["ShipNumber"]) else row["TrackNumber"], axis=1)
You can add the same column in table2 as well if needed and then use in left_in or right_on based on your requirement.