pandas: column formatting issues causing merge problems

pandas: column formatting issues causing merge problems - python

I have the following two databases:
url='https://raw.githubusercontent.com/108michael/ms_thesis/master/rgdp_catcode.merge'
df=pd.read_csv(url, index_col=0)
df.head(1)
naics catcode GeoName Description ComponentName year GDP state
0 22 E1600',\t'E1620',\t'A4000',\t'E5000',\t'E3000'... Alabama Utilities Real GDP by state 2004 5205 AL
url='https://raw.githubusercontent.com/108michael/ms_thesis/master/mpl.Bspons.merge'
df1=pd.read_csv(url, index_col=0)
df1.head(1)
state year unemployment log_diff_unemployment id.thomas party type date bills id.fec years_exp session name disposition catcode
0 AK 2006 6.6 -0.044452 1440 Republican sen 2006-05-01 s2686-109 S2AK00010 39 109 National Cable & Telecommunications Association support C4500
Regarding df, I had to manually input the catcode values. I think that is why the formatting is off. What I would like is to simply have the values without the \t prefix. I want to merge the dfs on catcode, state, year. I made a test earlier wherein a df1.catcode with only one value per cell was matched with the values in another df.catcode that had more than one value per cell and it worked.
So technically, all I need to do is lose the \t before each consecutive value in df.catcode, but additionally, if anyone has ever done a merge of this sort before, any 'caveats' learned through experience would be appreciated. My merge code looks like this:
mplmerge=pd.merge(df1,df, on=(['catcode', 'state', 'year']), how='left' )
I think this can be done with the regex method, I'm looking at the documentation now.

Cleaning catcode column in df is rather straightforward:
catcode_fixed = df.catcode.str.findall('[A-Z][0-9]{4}')
This will produce a series with a list of catcodes in every row:
catcode_fixed.head(3)
Out[195]:
0 [E1600, E1620, A4000, E5000, E3000, E1000]
1 [X3000, X3200, L1400, H6000, X5000]
2 [X3000, X3200, L1400, H6000, X5000]
Name: catcode, dtype: object
If I understand correctly what you want, then you need to "ungroup" these lists. Here is the trick, in short:
catcode_fixed = catcode_fixed = catcode_fixed.apply(pd.Series).stack()
catcode_fixed.index = catcode_fixed.index.droplevel(-1)
So, we've got (note the index values):
catcode_fixed.head(12)
Out[206]:
0 E1600
0 E1620
0 A4000
0 E5000
0 E3000
0 E1000
1 X3000
1 X3200
1 L1400
1 H6000
1 X5000
2 X3000
dtype: object
Now, dropping the old catcode and joining in the new one:
df.drop('catcode',axis = 1, inplace = True)
catcode_fixed.name = 'catcode'
df = df.join(catcode_fixed)
By the way, you may also need to use df1.reset_index() when merging the data frames.

Related

How to construct new DataFrame based on data from for loops?

I have a data set (datacomplete2), where I have data for each country for two different years. I want to calculate the difference between these years for each country (for values life, health, and lifegdp) and create a new data frame with the results.
The code:
for i in datacomplete2['Country'].unique():
life.append(datacomplete2.loc[(datacomplete2['Country']==i)&(datacomplete2['Year']==2016), 'life'] - datacomplete2.loc[(datacomplete2['Country']==i)&(datacomplete2['Year']==2000), 'life'])
health.append(datacomplete2.loc[(datacomplete2['Country']==i)&(datacomplete2['Year']==2016), 'health'] - datacomplete2.loc[(datacomplete2['Country']==i)&(datacomplete2['Year']==2000), 'health'])
lifegdp.append(datacomplete2.loc[(datacomplete2['Country']==i)&(datacomplete2['Year']==2016), 'lifegdp'] - datacomplete2.loc[(datacomplete2['Country']==i)&(datacomplete2['Year']==2000), 'lifegdp'])
newData = pd.DataFrame([life, health, lifegdp, datacomplete2['Country'].unique()], columns = ['life', 'health', 'lifegdp', 'country'])
newData
I think the for loop for calculating is correct, and the problem is in creating the new DataFrame. When I try to run the code, I get an error message: 4 columns passed, passed data had 210 columns.
I have 210 countries so I assume it somehow throws these values to the columns?
Here is also a link to a sneak peek of the data I'm using: https://i.imgur.com/jbGFPpk.png
The data as text would look like:
Country Code Year life health lifegdp
0 Algeria DZA 2000 70.292000 3.489033 20.146558
1 Algeria DZA 2016 76.078000 6.603844 11.520259
2 Angola AGO 2000 47.113000 1.908599 24.684593
3 Angola AGO 2016 61.547000 2.713149 22.684710
4 Antigua and Barbuda ATG 2000 73.541000 4.480701 16.412834
... ... ... ... ... ... ...
415 Vietnam VNM 2016 76.253000 5.659194 13.474181
416 World OWID_WRL 2000 67.684998 8.617628 7.854249
417 World OWID_WRL 2016 72.035337 9.978453 7.219088
418 Zambia ZMB 2000 44.702000 7.152371 6.249955
419 Zambia ZMB 2016 61.874000 4.477207 13.819775
Quick help required !!!
I started coding like two weeks ago so I'm very novice with this stuff.

Anurag Reddy's answer is a good concise solution if you know the dates in advance. To present an alternative and slightly more general answer - this problem is a good example use case for pandas.DataFrame.diff.
Note you don't actually need to sort the data in your example data but I've included a sort_values() line below to account for unsorted DataFrames.
import pandas as pd
# Read the raw datafile in
df = pd.read_csv("example.csv")
# Sort the data if required
df.sort_values(by=["Country"], inplace=True)
# Remove columns where you don't need the difference
new_df = df.drop(["Code", "Year"], axis=1)
# Group the data by country, take the difference between the rows, remove NaN rows, and reset the index to sequential integers
new_df = new_df.groupby(["Country"], as_index=False).diff().dropna().reset_index(drop=True)
# Add back the country names and codes as columns in the new DataFrame
new_df.insert(loc=0, column="Country", value=df["Country"].unique())
new_df.insert(loc=1, column="Code", value=df["Code"].unique())

You could do this instead
country_list = df.Country.unique().tolist()
df.drop(columns = ['Code'])
df_2016 = df.loc[(df['Country'].isin(country_list))&(df['Year']==2016)].reset_index()
df_2000 = df.loc[(df['Country'].isin(country_list))&(df['Year']==2000)].reset_index()
df_2016.drop(columns=['Year'])
df_2000.drop(columns=['Year'])
df_2016.set_index('Country').subtract(df_2000.set_index('Country'), fill_value=0)

How to append two dataframe objects containing same column data but different column names?

I want to append an expense df to a revenue df but can't properly do so. Can anyone offer how I may do this?
'''
import pandas as pd
import lxml
from lxml import html
import requests
import numpy as np
symbol = 'MFC'
url = 'https://www.marketwatch.com/investing/stock/'+ symbol +'/financials'
df=pd.read_html(url)
revenue = pd.concat(df[0:1]) # the revenue dataframe obj
revenue = revenue.dropna(axis='columns') # drop naN column
header = revenue.iloc[:0] # revenue df header row
expense = pd.concat(df[1:2]) # the expense dataframe obj
expense = expense.dropna(axis='columns') # drop naN column
statement = revenue.append(expense) #results in a dataframe with an added column (Unnamed:0)
revenue = pd.concat(df[0:1]) =
Fiscal year is January-December. All values CAD millions.
2015
2016
2017
2018
2019
expense = pd.concat(df[1:2]) =
Unnamed: 0
2015
2016
2017
2018
2019
'''
How can I append the expense dataframe to the revenue dataframe so that I am left with a single dataframe object?
Thanks,

Rename columns.
df = df.rename(columns={'old_name': 'new_name',})
Then append with merge(), join(), or concat().

I managed to append the dataframes with the following code. Thanks David for putting me on the right track. I admit this is not the best way to do this because in a run time environment, I don't know the value of the text to rename and I've hard coded it here. Ideally, it would be best to reference a placeholder at df.iloc[:0,0] instead, but I'm having a tough time getting that to work.
df=pd.read_html(url)
revenue = pd.concat(df[0:1])
revenue = revenue.dropna(axis='columns')
revenue.rename({'Fiscal year is January-December. All values CAD millions.':'LineItem'},axis=1,inplace=True)
header = revenue.iloc[:0]
expense = pd.concat(df[1:2])
expense = expense.dropna(axis='columns')
expense.rename({'Unnamed: 0':'LineItem'}, axis=1, inplace=True)
statement = revenue.append(expense,ignore_index=True)

Using the df=pd.read_html(url) construct, several lists are returned when scraping marketwatch financials. The below function returns a single dataframe of all balance sheet elements. The same code applies to quarterly and annual income and cash flow statements.
def getBalanceSheet(url):
df=pd.read_html(url)
count = sum([1 for Listitem in df if 'Unnamed: 0' in Listitem])
statement = pd.concat(df[0:1])
statement = statement.dropna(axis='columns')
if 'q' in url: #quarterly
statement.rename({'All values CAD millions.':'LineItem'},axis=1,inplace=True)
else:
statement.rename({'Fiscal year is January-December. All values CAD millions.':'LineItem'},axis=1,inplace=True)
for rowidx in range(count):
df_name = 'df_'+str(int(rowidx))
df_name = pd.concat(df[rowidx+1:rowidx+2])
df_name = df_name.dropna(axis='columns')
df_name.rename({'Unnamed: 0':'LineItem'}, axis=1, inplace=True)
statement = statement.append(df_name,ignore_index=True)
return statement

Calculating sum - after rows are grouped using groupby

I want to groupby a particular column in a dataframe and calculate sum of subgroups thus created, while retaining( displaying) all the records in each subgroup.
I am trying to create my own credit card expense tracking program. (I know there are several already available, but the idea is to learn Python.)
I have the usual fields of 'Merchant', 'Date', 'Type' and 'Amount'
I would like to do one of the following:
Group items by merchant, then within each such group, split the amount under (two new columns) 'debit' and 'credit'. I also want to be able to sum the the amounts under these columns. Repeat this for every merchant group.
If it is not possible to split based on 'Type' of the transaction (that is, as 'debit' and 'credit'), then I want to be able to sum the debits and credit SEPARATELY and also retain the line items (while displaying, that is.)Doing a sum() on the 'Amount' column gives just one number for each merchant and I verified that it is an incorrect amount.
My data frame looks like this:
Posted_Date Amount Type Merchant
0 04/20/2019 -89.70 Debit UNI
1 04/20/2019 -6.29 Debit BOOKM
2 04/20/2019 -36.42 Debit BROOKLYN
3 04/18/2019 -20.95 Debit MTA*METROCARD
4 04/15/2019 -29.90 Debit ZARA
5 04/15/2019 -7.70 Debit STILES
The code I have, after reading into a data frame and marking a transaction as credit or debit is:
merch_new = df_new.groupby(['Merchant','Type'])
merch_new.groups
for key, values in merch_new.groups.items():
df_new['Amount'].sum()
print(df_new.loc[values], "\n\n")
I was able to split it the way below:
Posted_Date Amount Type Merchant
217 05/23/2019 -41.70 Debit AT
305 04/27/2019 -12.40 Debit AT
Posted_Date Amount Type Merchant
127 07/08/2019 69.25 Credit AT
162 06/21/2019 139.19 Credit AT
Ideally, I would like something like the below:
the line items are displayed and a total for a given subgroup. In this case for merchant 'AT' and ideally sorted by date.
Date Merchant Credit Debit
305 4/27/2019 AT 0 -12.4
217 5/23/2019 AT 0 -41.7
162 6/21/2019 AT 139.19 0
127 7/8/2019 AT 69.25 0
208.44 -54.1
It appears simple, but I am unable to format it in this way.
EDIT:
I get an error for rename_axis():
rename_axis() got an unexpected keyword argument 'index'
and if I delete the index argument, I get the same error for 'columns'
I searched a lot for the usage (like Benoit showed) but I cannot find any. They all used strings or lists. I tried using:
rename_axis(None,None)
and I get the error:
ValueError: No axis named None for object type <class 'pandas.core.frame.DataFrame'>
I don't know if this is because of the python version I am using (3.6.6). I tried on both Spyder and Jupyter. But I get the same error.
I used:
rename_axis(None, axis=1) and I seem to get the desired results (sort of)
But I am unable to understand how this is being interpreted since there is no qualifier specified as to which argument it is reading into for "None". Can anyone please explain?
Any help is appreciated!
Thanks a lot!

I think you try to achieve something like this:
In [1]:
## Create example
import pandas as pd
cols = ['Posted_Date', 'Amount', 'Type', 'Merchant']
data = [['04/20/2019', -89.70, 'Debit', 'UNI'],
['04/20/2019', -6.29, 'Credit', 'BOOKM'],
['04/20/2019', -36.42, 'Debit', 'BROOKLYN'],
['04/20/2019', -6.29, 'Credit', 'BOOKM'],
['04/20/2019', -54.52, 'Credit', 'BROOKLYN'],
['04/18/2019', -20.95, 'Credit', 'BROOKLYN']]
df = pd.DataFrame(columns=cols, data=data)
## Pivot Table with aggregation function ='sum'
df_final = pd.pivot_table(df, values='Amount', index=['Posted_Date', 'Merchant'],
columns=['Type'], aggfunc='sum').fillna(0).reset_index().rename_axis(index=None, columns=None)
df_final['Total'] = df_final['Debit'] + df_final['Credit']
Out [1]:
Posted_Date Merchant Credit Debit Total
0 04/18/2019 BROOKLYN -20.95 0.00 -20.95
1 04/20/2019 BOOKM -12.58 0.00 -12.58
2 04/20/2019 BROOKLYN -54.52 -36.42 -90.94
3 04/20/2019 UNI 0.00 -89.70 -89.70

Value in pandas dataframe is 13 but not always recognized

I am working on an assignment for the coursera Introduction to Data Science course. I have a dataframe with 'Country' as the index and 'Rank" as one of the columns. When I try to reduce the data frame only to include the rows with countries in rank 1-15, the following works but excludes Iran, which is ranked 13.
df.set_index('Country', inplace=True)
df.loc['Iran', 'Rank'] = 13 #I did this in case there was some sort of
corruption in the original data
df_top15 = df.where(df.Rank < 16).dropna().copy()
return df_top15
When I try
df_top15 = df.where(df.Rank == 12).dropna().copy()
I get the row for Spain.
But when I try
df_top15 = df.where(df.Rank == 13).dropna().copy()
I just get the column headers, no row for Iran.
I also tried
df.Rank == 13
and got a series with False for all countries but Iran, which was True.
Any idea what could be causing this?

Your code works fine:
df = pd.DataFrame([['Italy', 5],
['Iran', 13],
['Tinbuktu', 20]],
columns=['Country', 'Rank'])
res = df.where(df.Rank < 16).dropna()
print(res)
Country Rank
0 Italy 5.0
1 Iran 13.0
However, I dislike this method because via mask the dtype of your Rank series becomes float due to initial conversion of some values to NaN.
A better idea, in my opinion, is to use query or loc. Using either method obviates the need for dropna:
res = df.query('Rank < 16')
res = df.loc[df['Rank'] < 16]
print(res)
Country Rank
0 Italy 5
1 Iran 13

Conditional Filling in Missing Values in a Pandas Data frame using non-conventional means

TLDR; How can I improve my code and make it more pythonic?
Hi,
One of the interesting challenge(s) we were given in a tutorial was the following:
"There are X missing entries in the data frame with an associated code but a 'blank' entry next to the code. This is a random occurance across the data frame. Using your knowledge of pandas, map each missing 'blank' entry to the associated code."
So this looks like the following:
|code| |name|
001 Australia
002 London
...
001 <blank>
My approach I have used is as follows:
Loop through entire dataframe and identify areas with blanks "". Replace all blanks via copying the associated correct code (ordered) to the dataframe.
code_names = [ "",
'Economic management',
'Public sector governance',
'Rule of law',
'Financial and private sector development',
'Trade and integration',
'Social protection and risk management',
'Social dev/gender/inclusion',
'Human development',
'Urban development',
'Rural development',
'Environment and natural resources management'
]
df_copy = df_.copy()
# Looks through each code name, and if it is empty, stores the proper name in its place
for x in range(len(df_copy.mjtheme_namecode)):
for y in range(len(df_copy.mjtheme_namecode[x])):
if(df_copy.mjtheme_namecode[x][y]['name'] == ""):
df_copy.mjtheme_namecode[x][y]['name'] = code_names[int(df_copy.mjtheme_namecode[x][y]['code'])]
limit = 25
counter = 0
for x in range(len(df_copy.mjtheme_namecode)):
for y in range(len(df_copy.mjtheme_namecode[x])):
print(df_copy.mjtheme_namecode[x][y])
counter += 1
if(counter >= limit):
break
While the above approach works - is there a better, more pythonic way of achieving what I'm after? I feel the approach I have used is very clunky due to my skills not being very well developed.
Thank you!

Method 1:
One way to do this would be to replace all your "" blanks with NaN, sort the dataframe by code and name, and use fillna(method='ffill'):
Starting with this:
>>> df
code name
0 1 Australia
1 2 London
2 1
You can apply the following:
new_df = (df.replace({'name':{'':np.nan}})
.sort_values(['code', 'name'])
.fillna(method='ffill')
.sort_index())
>>> new_df
code name
0 1 Australia
1 2 London
2 1 Australia
Method 2:
This is more convoluted, but will work as well:
Using groupby, first, and sqeeze, you can create a pd.Series to map the codes to non-blank names, and use .map to map that series to your code column:
df['name'] = (df['code']
.map(
df.replace({'name':{'':np.nan}})
.sort_values(['code', 'name'])
.groupby('code')
.first()
.squeeze()
))
>>> df
code name
0 1 Australia
1 2 London
2 1 Australia
Explanation: The pd.Series map that this creates looks like this:
code
1 Australia
2 London
And it works because it gets the first instance for every code (via the groupby), sorted in such a manner that the NaNs are last. So as long as each code is associated with a name, this method will work.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.