Pandas reading and sorting a file's content - python

I am reading a file from SIPRI. It reads in to pandas and dataframe is created and I can display it but when I try to sort by a column, I get a KeyError. Here is the code and the error:
import os
import pandas as pd
os.chdir('C:\\Users\\Student\\Documents')
#Find the top 20 countries in military spending by sorting
data = pd.read_excel('SIPRI-Milex-data-1949-2016.xls',
header = 0, index_col = 0, sheetname = 'Current USD')
data.sort_values(by = '2016', ascending = False)
KeyError: '2016'

You get the key error because the column '2016' is not present in the dataframe. Based on the excel file its in the integer form. Cleaning of data must be done in your dataframe to sort the things.
You can skip the top 5 rows and the bottom 8 rows to get the countries, then replace all the string and missing values with NaN. The following code will help you get that.
data = pd.read_excel('./SIPRI-Milex-data-1949-2016.xlsx', header = 0, index_col = 0, sheetname = 'Current USD',skiprows=5,skip_footer = 8)
data = data.replace(r'\s+', np.nan, regex=True).replace('xxx',np.nan)
new_df = data.sort_values(2016,ascending=False)
top_20 = new_df[:20].index.tolist()
Output:
['USA', 'China, P.R.', 'Russian Federation', 'Saudi Arabia', 'India', 'France', 'UK', 'Japan', 'Germany', 'Korea, South', 'Italy', 'Australia', 'Brazil', 'Israel', 'Canada', 'Spain', 'Turkey', 'Iran', 'Algeria', 'Pakistan']
​

Well this could be helpful, I guess:
data = pd.read_excel('SIPRI-Milex-data-1949-2016.xlsx', skiprows=5, index_col = 0, sheetname = 'Current USD')
data.dropna(inplace=True)
data.sort_values(by=2016, ascending=False, inplace=True)
And to get Top20 you can use:
data[data[2016].apply(lambda x: isinstance(x, (int, float)))][:20]

I downloaded the file and looks like the 2016 is not a column itself so you need to modify the dataframe a bit so as to change the row of country to be the header.
The next thing is you need to say data.sort_values(by = 2016, ascending = False). treat the column name as an integer instead of a string.
data = pd.read_excel('SIPRI-Milex-data-1949-2016.xlsx',
header = 0, index_col = 0, sheetname = 'Current USD')
data = data[4:]
data.columns = data.iloc[0]
data.sort_values(by =2016, ascending = False)

Related

Create multiple dataframes with a loop in Python

So I got this part of code that I want to make shorter:
df_1 = investpy.stocks.get_stock_recent_data('Eco','Colombia',False)
df_2 = investpy.stocks.get_stock_recent_data('JPM','United States',False)
df_3 = investpy.stocks.get_stock_recent_data('TSM','United States',False)
df_5 = investpy.stocks.get_stock_recent_data('CSCO','United States',False)
df_8 = investpy.stocks.get_stock_recent_data('NVDA','United States',False)
df_9 = investpy.stocks.get_stock_recent_data('BLK','United States',False)
As I use the same code and only a few things change from one line to another I think I migth solve this using a function. I create this one:
def _get_asset_data(ticker, country, state):
investpy.stocks.get_stock_recent_data(ticker, country, state)
So I tried this:
_get_asset_data('TSLA', 'United States', False)
print(_get_asset_data)
<function _get_asset_data at 0x7f323c912560>
However, I do not know how to make each set of data that I receive as a result of this function to be stored in a data frame for each company.I tried a for loop but got nowhere.
Any ideas? ¡Thank you in advance for your attention and cooperation!
Here is one approach based on the code given. You should refrain from using it in practice, as it contains redundant code, which makes it hard to maintain. You'll find a more flexible approach below.
Based on your solution
import investpy
import pandas as pd
def _get_asset_data(ticker, country, state=False):
return investpy.stocks.get_stock_recent_data(ticker, country, state)
df_1 = _get_asset_data('Eco','Colombia')
df_2 = _get_asset_data('JPM','United States')
df_3 = _get_asset_data('TSM','United States')
df_5 = _get_asset_data('CSCO','United States')
df_8 = _get_asset_data('NVDA','United States')
df_9 = _get_asset_data('BLK','United States')
final = pd.concat([df_1, df_2, df_3, df_5, df_8, df_9], axis=1)
final
More versatile solution:
import investpy
import pandas as pd
def _get_asset_data(ticker, country, state=False):
return investpy.stocks.get_stock_recent_data(ticker, country, state)
stocks = [
('Eco', 'Colombia'),
('JPM', 'United States'),
('TSM', 'United States'),
('CSCO', 'United States'),
('NVDA', 'United States'),
('BLK', 'United States'),
]
results = []
for stock in stocks:
result = _get_asset_data(*stock)
results.append(result)
final = pd.concat(results, axis=1)
final

Pandas: Create a new column with coulmn name and cell of matching string

I am searching through a large spreadsheet with 300 columns and over 200k rows. I would like to create a column that has the the column header and matching cell value. Some thing that looks like "Column||Value." I have the search term and the join aggregator. I can get the row index name but I'm struggling getting the matching column and specific cell. Here's me code so far
df = pd.read_excel (r"Test_file")
mask = df.astype(str).applymap(lambda x: any(y in x for y in ['Chann','Midm'])).any(1)
df['extract'] = df.loc[mask] #This only give me the index name. I would like the actual matched cell contents.
df['extract2'] = Column name
df['Match'] = df[['extract', 'extract2']].agg('||'.join.axis=1)
df.drop(['extract', 'extract2'], axis=1)
Final output should look something like
Output
You can create a mask for a specific column first (I edited your 2nd line a bit), then create a new 'Match' column with all values initialized to 'No Match', and finally, change the values to your desired format ("Column||Value") for rows that are returned after applying the mask. I implemented this in the following sample code:
def match_column(df, column_name):
column_mask = df.astype(str).applymap(lambda x: any(y in x for y in ['Chann','Midm']))[column_name]
df['Match'] = 'No Match'
df.loc[column_mask, 'Match'] = column_name + ' || ' + df[column_name]
return df
df = {
'Segment': ['Government', 'Government', 'Midmarket', 'Midmarket', 'Government', 'Channel Partners'],
'Country': ['Canada', 'Germany', 'France', 'Canada', 'France', 'France']
}
df = pd.DataFrame(df)
display(df)
df = match_column(df, 'Segment')
display(df)
Output:
However, this only works for a single column. I don't know what output you want for cases when there are matches in multiple columns (if you can, please specify).
UPDATE:
If you want to use a list of columns as input and match with the first instance, you can use this instead:
def match_first_column(df, column_list):
df['Match'] = 'No Match'
# iterate over rows
for index, row in df.iterrows():
# iterate over column names
for column_name in column_list:
column_value = row[column_name]
substrings = ['Chann', 'Midm', 'Fran']
# if a match is found
if any(x in column_value for x in substrings):
# add match string
df.loc[index, 'Match'] = column_name + ' || ' + column_value
# stop iterating and move to next row
break
return df
df = {
'Segment': ['Government', 'Government', 'Midmarket', 'Midmarket', 'Government', 'Channel Partners'],
'Country': ['Canada', 'Germany', 'France', 'Canada', 'France', 'France']
}
df = pd.DataFrame(df)
display(df)
column_list= df.columns.tolist()
match_first_column(df, column_list)
Output:
You can try:
mask = df.astype(str).applymap(lambda x: any(y in x for y in ['Chann','Midm'])).any(1)
df.loc[mask, 'Match'] = '||'.join(df[['extract', 'extract2']])
df['Match'].fillna('No Match', inplace=True)

Convert a muti-valued dict into a pandas dataframe

I want to convert this dict into a pandas dataframe where each key becomes a column and values in the list become the rows:
my_dict:
{'Last updated': ['2021-05-18T15:24:19.000Z', '2021-05-18T15:24:19.000Z'],
'Symbol': ['BTC', 'BNB', 'XRP', 'ADA', 'BUSD'],
'Name': ['Bitcoin', 'Binance Coin', 'XRP', 'Cardano', 'Binance USD'],
'Rank': [1, 3, 7, 4, 25],
}
The lists in my_dict can also have some missing values, which should appear as NaNs in dataframe.
This is how I'm currently trying to append it into my dataframe:
df = pd.DataFrame(columns = ['Last updated',
'Symbol',
'Name',
'Rank',]
df = df.append(my_dict, ignore_index=True)
#print(df)
df.to_excel(r'\walletframe.xlsx', index = False, header = True)
But my output only has a single row containing all the values.
The answer was pretty simple, instead of using
df = df.append(my_dict)
I used
df = pd.DataFrame.from_dict(my_dict).T
Which transposes the dataframe so it doesn't has any missing values for columns.
Credits to #Ank who helped me find the solution!

Add a new column containing the difference between EACH TWO ROWS of another column of a data frame

I would like to get the difference between each 2 rows of the column duration and then fill the values in a new column differenceor print it.
So basically I want: row(1)-row(2)=difference1, row(3)-row(4)=difference2, row(5)-row(6)=difference3 ....
Example of a code:
data = {'Profession':['Teacher', 'Banker', 'Teacher', 'Judge','lawyer','Teacher'], 'Gender':['Male','Male', 'Female', 'Male','Male','Female'],'Size':['M','M','L','S','S','M'],'Duration':['5','6','2','3','4','7']}
data2={'Profession':['Doctor', 'Scientist', 'Scientist', 'Banker','Judge','Scientist'], 'Gender':['Male','Male', 'Female','Female','Male','Male'],'Size':['L','M','L','M','L','L'],'Duration':['1','2','9','10','1','17']}
data3 = {'Profession':['Banker', 'Banker', 'Doctor', 'Doctor','lawyer','Teacher'], 'Gender':['Male','Male', 'Female', 'Female','Female','Male'],'Size':['S','M','S','M','L','S'],'Duration':['15','8','5','2','11','10']}
data4={'Profession':['Judge', 'Judge', 'Scientist', 'Banker','Judge','Scientist'], 'Gender':['Female','Female', 'Female','Female','Female','Female'],'Size':['M','S','L','S','M','S'],'Duration':['1','2','9','10','1','17']}
df= pd.DataFrame(data)
df2=pd.DataFrame(data2)
df3=pd.DataFrame(data3)
df4=pd.DataFrame(data4)
DATA=pd.concat([df,df2,df3,df4])
DATA.groupby(['Profession','Size','Gender']).agg('sum')
D=DATA.reset_index()
D['difference']=D['Duration'].diff(-1)
I tried using diff(-1) but it's not exactly what I'm looking for. any ideas?
Is that what you wanted?
D["Neighbour"]=D["Duration"].shift(-1)
# fill empty lines with 0
D["Neighbour"] = D["Neighbour"].fillna(0)
# convert columns "Neighbour" and "Duration" to numeric
D["Neighbour"] = pd.to_numeric(D["Neighbour"])
D["Duration"] = pd.to_numeric(D["Duration"])
# get difference
D["difference"]=D["Duration"] - D["Neighbour"]
# remove "Neighbour" column
D = D.drop(columns=["Neighbour"], axis=1)
# remove odd lines
D.loc[1::2,"difference"] = None
# print D
D

Check whether all unique value of column B are mapped with all unique value of Column A

I need little help, I know it's very easy I tried but didn't reach the goal.
# Import pandas library
import pandas as pd
data1 = [['India', 350], ['India', 600], ['Bangladesh', 350],['Bangladesh', 600]]
df1 = pd.DataFrame(data1, columns = ['Country', 'Bottle_Weight'])
data2 = [['India', 350], ['India', 600],['India', 200], ['Bangladesh', 350],['Bangladesh', 600]]
df2 = pd.DataFrame(data2, columns = ['Country', 'Bottle_Weight'])
data3 = [['India', 350], ['India', 600], ['Bangladesh', 350],['Bangladesh', 600],['Bangladesh', 200]]
df3 = pd.DataFrame(data3, columns = ['Country', 'Bottle_Weight'])
So basically I want to create a function, which will check the mapping by comparing all other unique countries(Bottle weights) with the first country.
According to the 1st Dataframe, It should return text as - All unique value of 'Bottle Weights' are mapped with all unique countries
According to the 2nd Dataframe, It should return text as - 'Country_name' not mapped 'Column name' 'value'
In this case, 'Bangladesh' not mapped with 'Bottle_Weight' 200
According to the 3rd Dataframe, It should return text as - All unique value of Bottle Weights are mapped with all unique countries (and in a new line) 'Country_name' mapped with new value '200'
It is not a particularly efficient algorithm, but I think this should get you the results you are looking for.
def check_weights(df):
success = True
countries = df['Country'].unique()
first_weights = df.loc[df['Country']==countries[0]]['Bottle_Weight'].unique()
for country in countries[1:]:
weights = df.loc[df['Country']==country]['Bottle_Weight'].unique()
for weight in first_weights:
if not np.any(weights[:] == weight):
success = False
print(f"{country} does not have bottle weight {weight}")
if success:
print("All bottle weights are shared with another country")

Categories

Resources