Add rows based on column date with two column unique key - pandas - python

So I have a dataframe like:
Number Country StartDate EndDate
12 US 1/1/2023 12/1/2023
12 Mexico 1/1/2024 12/1/2024
And what I am trying to do is:
Number Country Date
12 US 1/1/2023
12 US 2/1/2023
12 US 3/1/2023
12 US 4/1/2023
12 US 5/1/2023
12 US 6/1/2023
12 US 7/1/2023
12 US 8/1/2023
12 US 9/1/2023
12 US 10/1/2023
12 US 11/1/2023
12 US 12/1/2023
12 Mexico 1/1/2024
12 Mexico 2/1/2024
12 Mexico 3/1/2024
12 Mexico 4/1/2024
12 Mexico 5/1/2024
12 Mexico 6/1/2024
12 Mexico 7/1/2024
12 Mexico 8/1/2024
12 Mexico 9/1/2024
12 Mexico 10/1/2024
12 Mexico 11/1/2024
12 Mexico 12/1/2024
This problem is very similar to Adding rows for each month in a dataframe based on column date
However that problem only accounts for the unique key being one column. In this example the unique key is the Number and the Country.
This is what I am currently doing however, it only accounts for one column 'Number' and I need to include both Number and Country as they are the unique key.
df1 = pd.concat([pd.Series(r.Number, pd.date_range(start = r.StartDate, end = r.EndDate, freq='MS'))
for r in df1.itertuples()]).reset_index().drop_duplicates()

Create the range then explode
df['New']= [pd.date_range(start = x, end = y, freq='MS') for x , y in zip(df.pop('StartDate'),df.pop('EndDate'))]
df=df.explode('New')
Out[54]:
Number Country New
0 12 US 2023-01-01
0 12 US 2023-02-01
0 12 US 2023-03-01
0 12 US 2023-04-01
0 12 US 2023-05-01
0 12 US 2023-06-01
0 12 US 2023-07-01
0 12 US 2023-08-01
0 12 US 2023-09-01
0 12 US 2023-10-01
0 12 US 2023-11-01
0 12 US 2023-12-01
1 12 Mexico 2024-01-01
1 12 Mexico 2024-02-01
1 12 Mexico 2024-03-01
1 12 Mexico 2024-04-01
1 12 Mexico 2024-05-01
1 12 Mexico 2024-06-01
1 12 Mexico 2024-07-01
1 12 Mexico 2024-08-01
1 12 Mexico 2024-09-01
1 12 Mexico 2024-10-01
1 12 Mexico 2024-11-01
1 12 Mexico 2024-12-01

Related

how to generate new table from existing table by grouping them aggregated MODE values

I've a data frame in pandas, and I'm trying to generate a new table based on existing table by grouping them with their aggregated mode value.
df
country scores attempts
india 11 6
india 12 3
india 12 3
india 12 7
india 10 3
india 12 3
pakistan 10 4
Pakistan 14 4
pakistan 14 5
srilanka 23 5
srilanka 21 5
srilanka 21 6
srilanka 23 5
srilanka 23 6
srilanka 23 5
Result will be like this
country scores attempts
0 India 12 3
1 Pakistan 14 4
2 srilanka 23 5
please help me solve this issue.
Use GroupBy.size first and then get first modes by DataFrameGroupBy.idxmax for indice by maximal counts:
print (df)
country scores attempts
0 india 11 6
1 india 12 3
2 india 12 3
3 india 12 7
4 india 10 3
5 india 12 3
6 pakistan 10 4
7 pakistan 14 4
8 pakistan 14 4 <- correct data
9 srilanka 23 5
10 srilanka 21 5
11 srilanka 21 6
12 srilanka 23 5
13 srilanka 23 6
14 srilanka 23 5
df1 = df.groupby(['country','scores','attempts']).size().reset_index(name='count')
print (df1)
country scores attempts count
0 india 10 3 1
1 india 11 6 1
2 india 12 3 3
3 india 12 7 1
4 pakistan 10 4 1
5 pakistan 14 4 2
6 srilanka 21 5 1
7 srilanka 21 6 1
8 srilanka 23 5 3
9 srilanka 23 6 1
df2 = df1.loc[df1.groupby('country')['count'].idxmax()].drop('count', axis=1).reset_index(drop=True)
print (df2)
country scores attempts
0 india 12 3
1 pakistan 14 4
2 srilanka 23 5

How to replace the values of a column to other columns only in NaN values?

How to fill the values of column ["state"] with another column ["country"] only in NaN values?
Like in this Pandas DataFrame:
state country sum
0 NaN China 1
1 Assam India 2
2 Odisa India 3
3 Bihar India 4
4 NaN India 5
5 NaN Srilanka 6
6 NaN Malaysia 7
7 NaN Bhutan 8
8 California US 9
9 Texas US 10
10 Newyork US 11
11 NaN US 12
12 NaN Canada 13
What code should I do to fill state columns with country column only in NaN values, like this:
state country sum
0 China China 1
1 Assam India 2
2 Odisa India 3
3 Bihar India 4
4 India India 5
5 Srilanka Srilanka 6
6 Malaysia Malaysia 7
7 Bhutan Bhutan 8
8 California US 9
9 Texas US 10
10 Newyork US 11
11 US US 12
12 Canada Canada 13
I can use this code:
df.loc[df['state'].isnull(), 'state'] = df[df['state'].isnull()]['country'].replace(df['country'])
But in a very large dataset with 300K of rows, it compute for 5-6 minutes and crashed every time. Because it is replacing one value at a time.
Like this
Can anyone help me with efficient code for this?
Please!
Perhaps using fillna without checking for isnull() and replace():
df['state'].fillna(df['country'], inplace=True)
print(df)
Output
state country sum
0 China China 1
1 Assam India 2
2 Odisa India 3
3 Bihar India 4
4 India India 5
5 Srilanka Srilanka 6
6 Malaysia Malaysia 7
7 Bhutan Bhutan 8
8 California US 9
9 Texas US 10
10 Newyork US 11
11 US US 12
12 Canada Canada 13

Python : Dropping rows of a dataframe and keep a specific group

The question is still not answered !!!!
Let's say that I have this dataframe :
import pandas as pd
Name = ['ID', 'Country', 'IBAN','ID_bal_amt', 'ID_bal_time','Dan_city','ID_bal_mod','Dan_country','ID_bal_type', 'ID_bal_amt', 'ID_bal_time','ID_bal_mod','ID_bal_type' ,'Dan_sex', 'Dan_Age', 'Dan_country','Dan_sex' , 'Dan_city','Dan_country','ID_bal_amt', 'ID_bal_time','ID_bal_mod','ID_bal_type' ]
Value = ['TAMARA_CO', 'GERMANY','FR56', '12','June','Berlin','OPBD', '55','CRDT','432', 'August', 'CLBD','DBT', 'M', '22', 'FRA', 'M', 'Madrid', 'ESP','432','March','FABD','CRDT']
Ccy = ['','','','EUR','EUR','','EUR','','','','EUR','EUR','USD','USD','USD','','CHF', '','DKN','','','USD','CHF']
Group = ['0','0','0','1','1','1','1','1','1','2','2','2','2','2','2','2','3','3','3','4','4','4','4']
df = pd.DataFrame({'Name':Name, 'Value' : Value, 'Ccy' : Ccy,'Group':Group})
print(df)
Name Value Ccy Group
0 ID TAMARA_CO 0
1 Country GERMANY 0
2 IBAN FR56 0
3 ID_bal_amt 12 EUR 1
4 ID_bal_time June EUR 1
5 Dan_city Berlin 1
6 ID_bal_mod OPBD EUR 1
7 Dan_country 55 1
8 ID_bal_type CRDT 1
9 ID_bal_amt 432 2
10 ID_bal_time August EUR 2
11 ID_bal_mod CLBD EUR 2
12 ID_bal_type DBT USD 2
13 Dan_sex M USD 2
14 Dan_Age 22 USD 2
15 Dan_country FRA 2
16 Dan_sex M CHF 3
17 Dan_city Madrid 3
18 Dan_country ESP DKN 3
19 ID_bal_amt 432 4
20 ID_bal_time March 4
21 ID_bal_mod FABD USD 4
22 ID_bal_type CRDT CHF 4
I want to reduce this dataframe ! I want to reduce only the rows that contains the string "bal" by keeping the group of rows that is associated at the the mode : "CLBD". That means that I search the value "CLBD" for the the name "ID_bal_mod" and then I keep all the others names ID_bal_amt, ID_bal_time, ID_bal_mod, ID_bal_type that are in the same group. In our example, it is the names that are in the group 2
In addition, I want to change the their value in the column "Group" to 0.
So at the end I would like to get this new dataframe where the indexing is reset too
Name Value Ccy Group
0 ID TAMARA_CO 0
1 Country GERMANY 0
2 IBAN FR56 0
3 Dan_city Berlin 1
4 Dan_country 55 1
5 ID_bal_amt 432 0
6 ID_bal_time August EUR 0
7 ID_bal_mod CLBD EUR 0
8 ID_bal_type DBT USD 0
9 Dan_sex M USD 2
10 Dan_Age 22 USD 2
11 Dan_country FRA 2
12 Dan_sex M CHF 3
13 Dan_city Madrid 3
14 Dan_country ESP DKN 3
Anyone has an efficient idea ?
Thank you
Let's try your logic:
rows_with_bal = df['Name'].str.contains('bal')
groups_with_CLBD = ((rows_with_bal & df['Value'].eq('CLBD'))
.groupby(df['Group']).transform('any')
)
# set the `Group` to 0 for `groups_with_CLBD`
df.loc[groups_with_CLBD, 'Group'] = 0
# keep the rows without bal or `groups_with_CLBD`
df = df.loc[(~rows_with_bal) | groups_with_CLBD]
Output:
Name Value Ccy Group
0 ID TAMARA_CO 0
1 Country GERMANY 0
2 IBAN FR56 0
5 Dan_city Berlin 1
7 Dan_country 55 1
9 ID_bal_amt 432 0
10 ID_bal_time August EUR 0
11 ID_bal_mod CLBD EUR 0
12 ID_bal_type DBT USD 0
13 Dan_sex M USD 0
14 Dan_Age 22 USD 0
15 Dan_country FRA 0
16 Dan_sex M CHF 3
17 Dan_city Madrid 3
18 Dan_country ESP DKN 3

Python Pivot: Can I get the count of columns per row(id/index) and store it in a new columns?

hope you can help me this.
The df looks like this.
region AMER
country Brazil Canada Columbia Mexico United States
metro Rio de Janeiro Sao Paulo Toronto Bogota Mexico City Monterrey Atlanta Boston Chicago Culpeper Dallas Denver Houston Los Angeles Miami New York Philadelphia Seattle Silicon Valley Washington D.C.
ID
321321 2 1 1 13 15 29 1 2 1 11 6 15 3 2 14 3
23213 3
231 2 2 3 1 5 6 3 3 4 3 3 4
23213 4 1 1 1 4 1 2 27 1
21321 4 2 2 1 14 3 2 4 2
12321 1 2 1 1 1 1 10
123213 2 45 5 1
12321 1
123 1 3 2
I want to get the count of columns that have data per of metro and country per region of all the rows(id/index) and store that count into a new column.
Regards,
RJ
You may want to try
df['new']df.sum(level=0, axis=1)

pandas: filtering by group size and data value

Having grouped data, I want to drop from the results groups that contain only a single observation with the value below a certain threshold.
Initial data:
df = pd.DataFrame(data={'Province' : ['ON','QC','BC','AL','AL','MN','ON'],
'City' :['Toronto','Montreal','Vancouver','Calgary','Edmonton','Winnipeg','Windsor'],
'Sales' : [13,6,16,8,4,3,1]})
City Province Sales
0 Toronto ON 13
1 Montreal QC 6
2 Vancouver BC 16
3 Calgary AL 8
4 Edmonton AL 4
5 Winnipeg MN 3
6 Windsor ON 1
Now grouping the data:
df.groupby(['Province', 'City']).sum()
Sales
Province City
AL Calgary 8
Edmonton 4
BC Vancouver 16
MN Winnipeg 3
ON Toronto 13
Windsor 1
QC Montreal 6
Now the part I can't figure out is how to drop provinces with only one city (or generally N observations) with the total sales less then 10. The expected output should be:
Sales
Province City
AL Calgary 8
Edmonton 4
BC Vancouver 16
ON Toronto 13
Windsor 1
I.e. MN/Winnipeg and QC/Montreal are gone from the results. Ideally, they won't be completely gone but combined into a new group called 'Other', but this may be material for another question.
you can do it this way:
In [188]: df
Out[188]:
City Province Sales
0 Toronto ON 13
1 Montreal QC 6
2 Vancouver BC 16
3 Calgary AL 8
4 Edmonton AL 4
5 Winnipeg MN 3
6 Windsor ON 1
In [189]: g = df.groupby(['Province', 'City']).sum().reset_index()
In [190]: g
Out[190]:
Province City Sales
0 AL Calgary 8
1 AL Edmonton 4
2 BC Vancouver 16
3 MN Winnipeg 3
4 ON Toronto 13
5 ON Windsor 1
6 QC Montreal 6
Now we will create a mask for those 'provinces with more than one city':
In [191]: mask = g.groupby('Province').City.transform('count') > 1
In [192]: mask
Out[192]:
0 True
1 True
2 False
3 False
4 True
5 True
6 False
dtype: bool
And cities with the total sales greater or equal to 10 win:
In [193]: g[(mask) | (g.Sales >= 10)]
Out[193]:
Province City Sales
0 AL Calgary 8
1 AL Edmonton 4
2 BC Vancouver 16
4 ON Toronto 13
5 ON Windsor 1
I wasn't satisfied with any of the answers given, so I kept chipping at this until I figured out the following solution:
In [72]: df
Out[72]:
City Province Sales
0 Toronto ON 13
1 Montreal QC 6
2 Vancouver BC 16
3 Calgary AL 8
4 Edmonton AL 4
5 Winnipeg MN 3
6 Windsor ON 1
In [73]: df.groupby(['Province', 'City']).sum().groupby(level=0).filter(lambda x: len(x)>1 or x.Sales > 10)
Out[73]:
Sales
Province City
AL Calgary 8
Edmonton 4
BC Vancouver 16
ON Toronto 13
Windsor 1

Categories

Resources