Python Pandas compare two dataframes to assign country to phone number - python

I have two dataframes that I read in via csv. Dataframe one consists of a phone number and some additional data. The second dataframe contains country codes and country names.
I want to take the phone number from the first dataset and compare it to the country codes of the second. Country codes can between one to four digits long. I go from the longest country code to the shortest. If there is a match, I want to assign the country name to the phone number.
Input longlist:
phonenumber, add_info
34123425209, info1
92654321762, info2
12018883637, info3
6323450001, info4
496789521134, info5
Input country_list:
country;country_code;order_info
Spain;34;1
Pakistan;92;4
USA;1;2
Philippines;63;3
Germany;49;4
Poland;48;1
Norway;47;2
Output should be:
phonenumber, add_info, country, order_info
34123425209, info1, Spain, 1
92654321762, info2, Pakistan, 4
12018883637, info3, USA, 2
6323450001, info4, Philippines, 3
496789521134, info5, Germany, 4
I have it solved once like this:
#! /usr/bin/python
import csv
import pandas
with open ('longlist.csv','r') as lookuplist:
with open ('country_list.csv','r') as inputlist:
with open('Outputfile.csv', 'w') as outputlist:
reader = csv.reader(lookuplist, delimiter=',')
reader2 = csv.reader(inputlist, delimiter=';')
writer = csv.writer(outputlist, dialect='excel')
for i in reader2:
for xl in reader:
if xl[0].startswith(i[1]):
zeile = [xl[0], xl[1], i[0], i[1], i[2]]
writer.writerow(zeile)
lookuplist.seek(0)
But I would like to solve this problem, using pandas. What I got to work:
- Read in the csv files
- Remove duplicates from "longlist"
- Sort list of countries / country code
This is, what I have working already:
import pandas as pd, numpy as np
longlist = pd.read_csv('path/to/longlist.csv',
usecols=[2,3], names=['PHONENUMBER','ADD_INFO'])
country_list = pd.read_csv('path/to/country_list.csv',
sep=';', names=['COUNTRY','COUNTRY_CODE','ORDER_INFO'], skiprows=[0])
# remove duplicates and make phone number an index
longlist = longlist.drop_duplicates('PHONENUMBER')
longlist = longlist.set_index('PHONENUMBER')
# Sort country list, from high to low value and make country code an index
country_list=country_list.sort_values(by='COUNTRY_CODE', ascending=0)
country_list=country_list.set_index('COUNTRY_CODE')
(...)
longlist.to_csv('path/to/output.csv')
But any way trying the same with datasets does not work. I cannot apply startswith (cannot iterate through objects and cannot apply it on objects). I would really appreciate your help.

i would do it this way:
cl = pd.read_csv('country_list.csv', sep=';', dtype={'country_code':str})
ll = pd.read_csv('phones.csv', skipinitialspace=True, dtype={'phonenumber':str})
lookup = cl['country_code']
lookup.index = cl['country_code']
ll['country_code'] = (
ll['phonenumber']
.apply(lambda x: pd.Series([lookup.get(x[:4]), lookup.get(x[:3]),
lookup.get(x[:2]), lookup.get(x[:1])]))
.apply(lambda x: x.get(x.first_valid_index()), axis=1)
)
# remove `how='left'` parameter if you don't need "unmatched" phone-numbers
result = ll.merge(cl, on='country_code', how='left')
Output:
In [195]: result
Out[195]:
phonenumber add_info country_code country order_info
0 34123425209 info1 34 Spain 1.0
1 92654321762 info2 92 Pakistan 4.0
2 12018883637 info3 1 USA 2.0
3 12428883637 info31 1242 Bahamas 3.0
4 6323450001 info4 63 Philippines 3.0
5 496789521134 info5 49 Germany 4.0
6 00000000000 BAD None NaN NaN
Explanation:
In [216]: (ll['phonenumber']
.....: .apply(lambda x: pd.Series([lookup.get(x[:4]), lookup.get(x[:3]),
.....: lookup.get(x[:2]), lookup.get(x[:1])]))
.....: )
Out[216]:
0 1 2 3
0 None None 34 None
1 None None 92 None
2 None None None 1
3 1242 None None 1
4 None None 63 None
5 None None 49 None
6 None None None None
phones.csv: - i've intentionally added one Bahamas number (1242...) and one invalid number (00000000000)
phonenumber, add_info
34123425209, info1
92654321762, info2
12018883637, info3
12428883637, info31
6323450001, info4
496789521134, info5
00000000000, BAD

Related

Pandas Taking a CSV and combining/concatenating rows from a previously pivoted csv

All I have a simple objective.
I have a CSV that has thousands of columns and thousands of rows. I want to take the existing CSV and literally concatenate/combine the values ONLY from Rows 1 & Row 2 into one single row similar to below. The key thing to keep in mind is that some of the values like "lion, tiger, bear repeat several times once for each metric). I do not want it to display lion.1 , lion.2 etc , it should just display lion.
Data sample for Rows 1 and 2:
flow color desc lion tiger bear lion tiger bear
flow color desc m1 m1 m1 m2 m2 m2
flavor1 catego1 flavor1 catego1 32 23 34 34 21 24
flavor2 catego2 flavor2 catego2 32 23 34 34 21 24
How I want date to appear in CSV in Row 1 (note we need row 2 to NOT appear in the file after we combine them):
"flow flow" "color color" "desc desc" "lion m1" "tiger m1" "bear m1" "lion m2" `“tiger m2” “bear m2”
"flavor1 catego1" flavor1 catego1 32 23 34 34 21 24
"flavor1 catego2" flavor2 catego2 32 23 34 34 21 24
Sad code attempt:
import pandas as pd
df = pd.read_csv(r"C:Test.csv")
row_one = df.iloc[0]
spacer = " "
row_two = df.iloc[1]
new_header = row_one+spacer+row_two
Updated based on new information.
I'm assuming you haven't included the row index in your output and so I've got what works for my small dateset and I think it would work for yours.
import pandas as pd
def main():
df = pd.DataFrame([['flow', 'color', 'desc'], ['flavor1 catego1', 'flavor1', 'catego1'], ['flavor2 catego2', 'flavor2', 'catego2']], columns=['flow', 'color', 'desc'])
print('before')
print(df)
df = df.rename(columns={x: f'{x} {y}' for x, y in zip(df.columns, df.iloc[0])})
print(f'Index of row to delete: {df.iloc[0].name}')
df.drop(index=df.iloc[0].name, inplace=True)
print()
print('after')
print(df)
if __name__ == '__main__':
main()
Output
before
flow color desc
0 flow color desc
1 flavor1 catego1 flavor1 catego1
2 flavor2 catego2 flavor2 catego2
Index of row to delete: 0
after
flow flow color color desc desc
1 flavor1 catego1 flavor1 catego1
2 flavor2 catego2 flavor2 catego2

Removing columns for which the column names are a float

I have data as follows:
import pandas as pd
url_cities="https://population.un.org/wup/Download/Files/WUP2018-F12-Cities_Over_300K.xls"
df_cities = pd.read_excel(url_cities)
i = df_cities.iloc[:, 1].notna().idxmax()
df_cities.columns = df_cities.iloc[i].tolist()
df_cities = df_cities.iloc[i+1:]
df_cities = df_cities.rename(columns={2020.0: 'City_pop'})
print(df_cities.iloc[0:20,])
I want to remove all columns for which the column names (NOT COLUMN VALUES) are floats.
I have looked at a couple of links (A, B, C), but I could not find the answer. Any suggestions?
This will do what your question asks:
df = df[[col for col in df.columns if not isinstance(col, float)]]
Example:
import pandas as pd
df = pd.DataFrame(columns=['a',1.1,'b',2.2,3,True,4.4,'c'],data=[[1,2,3,4,5,6,7,8],[11,12,13,14,15,16,17,18]])
print(df)
df = df[[col for col in df.columns if not isinstance(col, float)]]
print(df)
Initial dataframe:
a 1.1 b 2.2 3 True 4.4 c
0 1 2 3 4 5 6 7 8
1 11 12 13 14 15 16 17 18
Result:
a b 3 True c
0 1 3 5 6 8
1 11 13 15 16 18
Note that 3 is an int, not a float, so its column has not been removed.
my_list=list(df_cities.columns)
for i in my_list:
if type(i)!=str:
df_cities=df_cities.drop(columns=[i],axis=1)
please, try this code
I think your basic problem is the call to read the excel file.
If you skip early rows and define the index correctly6, you avoid the issue of having to remove float column headers altogether.
so change your call to open the excel file to the following:
df_cities = pd.read_excel(url_cities, skiprows=16, index_col=0)
Which yields a df like the following:
Country Code Country or area City Code Urban Agglomeration Note Latitude Longitude 1950 1955 1960 ... 1990 1995 2000 2005 2010 2015 2020 2025 2030 2035
Index
1 4 Afghanistan 20001 Herat NaN 34.348170 62.199670 82.468 85.751 89.166 ... 183.465 207.190 233.991 275.678 358.691 466.703 605.575 752.910 897.041 1057.573
2 4 Afghanistan 20002 Kabul NaN 34.528887 69.172460 170.784 220.749 285.352 ... 1549.320 1928.694 2401.109 2905.178 3289.005 3723.543 4221.532 4877.024 5737.138 6760.500
3 4 Afghanistan 20003 Kandahar NaN 31.613320 65.710130 82.199 89.785 98.074 ... 233.243 263.395 297.456 336.746 383.498 436.741 498.002 577.128 679.278 800.461
4 4 Afghanistan 20004 Mazar-e Sharif NaN 36.709040 67.110870 30.000 37.139 45.979 ... 135.153 152.629 172.372 206.403 283.532 389.483 532.689 681.531 816.040 962.262

Iterating through a dataframe to pull mins and max's as well as other columns based off of those values

I'm fairly new to python and very new to pandas so any help would be appreciated!
I have a dataframe where the data is structured like below:
Batch_Name
Tag 1
Tag 2
2019-01
1
3
2019-02
2
3
I want to iterate through the dataframe and pull the following into a new dataframe:
The max value for each tag (there are 5 in my full data frame)
The batch name at that max value
The min value for that tag
The batch name at that min value
The average for that tag
The std for that tag
I have had a lot of trouble trying to mentally structure this, but I run into errors even trying to create the dataframe with the summary statistics. Below is my first attempt at creating a new method with the stats, I wasn't sure how to pull the batch names at all.
def tag_stats(df):
min_col = {}
min_col_batch = {}
max_col = {}
max_col_batch = {}
std_col = {}
avg_col = {}
for col in range(df.shape[3:]):
max_col[col]= df[col].max()
min_col[col]= df[col].min()
std_col[col]= df[col].std()
avg_col[col]= df[col].avg()
result = pd.DataFrame([min_col, max_col, std_col, avg_col], index=['min', 'max', 'std', 'avg'])
return result
Here is an answer based on your code!
import pandas as pd
import numpy as np
#Slightly modified your function
def tag_stats(df, tag_list):
df = df.set_index('Batch_Name')
data = {
'tag':[],
'min':[],
'max':[],
'min_batch':[],
'max_batch':[],
'std':[],
'mean':[],
}
for tag in tag_list:
values = df[tag]
data['tag'].append(tag)
data['min'].append(values.min())
data['max'].append(values.max())
data['min_batch'].append(values.idxmin())
data['max_batch'].append(values.idxmax())
data['std'].append(values.std())
data['mean'].append(values.mean())
result = pd.DataFrame(data)
return result
#Create a df using some random data
np.random.seed(1)
num_batches = 10
df = pd.DataFrame({
'Batch_Name':['batch_{}'.format(i) for i in range(num_batches)],
'Tag 1':np.random.randint(1,100,num_batches),
'Tag 2':np.random.randint(1,100,num_batches),
'Tag 3':np.random.randint(1,100,num_batches),
'Tag 4':np.random.randint(1,100,num_batches),
'Tag 5':np.random.randint(1,100,num_batches),
})
#Apply your function
cols = ['Tag 1','Tag 2','Tag 3','Tag 4','Tag 5']
summary_df = tag_stats(df, cols)
print(summary_df)
Output
tag min max min_batch max_batch std mean
0 Tag 1 2 80 batch_9 batch_6 32.200759 38.0
1 Tag 2 7 85 batch_2 batch_7 28.926919 39.9
2 Tag 3 14 97 batch_9 batch_7 33.297314 63.4
3 Tag 4 1 82 batch_7 batch_9 31.060693 37.1
4 Tag 5 4 89 batch_7 batch_1 31.212711 43.3
The comment for #It_is_Chris is great too, here is an answer based on it
import pandas as pd
import numpy as np
#Create a df using some random data
np.random.seed(1)
num_batches = 10
df = pd.DataFrame({
'Batch_Name':['batch_{}'.format(i) for i in range(num_batches)],
'Tag 1':np.random.randint(1,100,num_batches),
'Tag 2':np.random.randint(1,100,num_batches),
'Tag 3':np.random.randint(1,100,num_batches),
'Tag 4':np.random.randint(1,100,num_batches),
'Tag 5':np.random.randint(1,100,num_batches),
})
#Convert to a long df and index by Batch_Name:
# index | tag | tag_value
# ------------------------------------
# batch_0 | Tag 1 38 | 38
# batch_1 | Tag 1 13 | 13
# batch_2 | Tag 1 73 | 73
long_df = df.melt(
id_vars = 'Batch_Name',
var_name = 'tag',
value_name = 'tag_value',
).set_index('Batch_Name')
#Groupby tag and aggregate to get columns of interest
summary_df = long_df.groupby('tag').agg(
max_value = ('tag_value','max'),
max_batch = ('tag_value','idxmax'),
min_value = ('tag_value','min'),
min_batch = ('tag_value','idxmin'),
mean_value = ('tag_value','mean'),
std_value = ('tag_value','std'),
).reset_index()
summary_df
Output:
tag max_value max_batch min_value min_batch mean_value std_value
0 Tag 1 80 batch_6 2 batch_9 38.0 32.200759
1 Tag 2 85 batch_7 7 batch_2 39.9 28.926919
2 Tag 3 97 batch_7 14 batch_9 63.4 33.297314
3 Tag 4 82 batch_9 1 batch_7 37.1 31.060693
4 Tag 5 89 batch_1 4 batch_7 43.3 31.212711

Skipping the row if there are more than 2 fields are empty

First, skip the row of data if the columns have more than 2 columns that are empty. After this step, the rows with more than 2 columns missing value will be filtered out.
Then, as some of the columns still have 1 or 2 columns are empty. So I will fill in the empty column with the mean value of that row.
I can run the second step with my code below, however, I am not sure how to filter out the rows with more than 2 columns missing value.
I have tried using dropna but it deleted all the columns of the table.
My code:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as pp
%matplotlib inline
# high technology exports percentage of manufatory exports
hightech_export = pd.read_csv('hightech_export_1.csv')
#skip the row of data if the columns have more than 2 columns are empty
hightech_export.dropna(axis=1, how='any', thresh=2, subset=None, inplace=False)
# Fill in data with mean value.
m = hightech_export.mean(axis=1)
for i, col in enumerate(hightech_export):
hightech_export.iloc[:, i] = hightech_export.iloc[:, i].fillna(m)
My dataset:
Country Name 2001 2002 2003 2004
Philippines 71
Malta 62 58 60 58
Singapore 60 56
Malaysia 58 57 55
Ireland 47 41 34 34
Georgia 38 41 24 38
Costa Rica
You can make use of .isnull() method for doing your first task.
Replace this:
hightech_export.dropna(axis=1, how='any', thresh=2, subset=None, inplace=False)
with:
hightech_export= hightech_export.loc[hightech_export.isnull().sum(axis=1)<=2]
Ok try this ...
import pandas as pd
import numpy as np
data1={'Name':['Tom',np.NaN,'Mary','Jane'],'Age':[20,np.NaN,40,30],'Pay':[np.NaN,np.NaN,20,25]}
data2={'Name':['Tom','Bob','Mary'],'Age':[40,30,20]}
df1=pd.DataFrame.from_records(data1)
Check the df
df1
Age Name Pay
0 20.0 Tom NaN
1 NaN NaN NaN
2 40.0 Mary 20.0
3 30.0 Jane 25.0
record with index 1 has 3 missing values...
Replace and make missing values None
df1 = df1.replace({pd.np.nan: None})
Now write function to count missing values per row.... and to create a list
def count_na(lst):
missing = [n for n in lst if not n]
return len(missing)
missing_data=[]
for index,n in df1.iterrows():
missing_data.append(count_na(list(n)))
Use this list as a new Column in the Dataframe
df1['missing']=missing_data
df1 should look like this
Age Name Pay missing
0 20 Tom None 1
1 None None None 3
2 40 Mary 20 0
3 30 Jane 25 0
So filtering becomes easy....
# Now only take records with <2 missing
df1[df1.missing<2]
Hope that helps...
A simple way is to compare on a row basis the count of value and the number of columns of the dataframe. You can then just replace NaN with the avg of the dataframe.
Code could be:
result = df.loc[df.apply(lambda x: x.count(), axis=1) >= (len(df.columns) - 2)].replace(
np.nan, df.agg('mean'))
With your example data, it gives as expected:
Country Name 2001 2002 2003 2004
1 Malta 62.0 58.00 60.000000 58.0
2 Singapore 60.0 49.25 39.333333 56.0
3 Malaysia 58.0 57.00 39.333333 55.0
4 Ireland 47.0 41.00 34.000000 34.0
5 Georgia 38.0 41.00 24.000000 38.0
Try this
hightech_export.dropna(thresh=2, inplace=True)
in place of the line of code
hightech_export.dropna(axis=1, how='any', thresh=2, subset=None, inplace=False)

Iterate through a Pandas DataFrame with multiple handles, and iteratively append edited rows?

I have 3 dataframes:
df1 with match history (organized by date)
df2 with player stats (organized by player name)
df3 difference between player stats (df2) per match (df1) [in progress]
I want to do something like:
for idx, W_nm, L_nm in df1[['index','winner_name','loser_name']].values:
df3.loc[idx] = df2.loc[W_nm] - df2.loc[L_nm]
#... edit this row further
Which fails because:
'idx' doesn't reference df1's indices
df3 has no defined columns
Is there a way to reference the indices on the first line?
I've read iterrows() is 7x slower than .loc[] and I have quite a bit of data to process
Is there anything cleaner than this:
for idx in df1.index:
W_nm = df1.loc[idx,'winner_name']
L_nm = df1.loc[idx,'loser_name']
df3.loc[idx] = df2.loc[W_nm] - df2.loc[L_nm]
#... edit this row further
Which doesn't fix the "no defined columns", but gives me my handles.
So I'm expecting something like:
df1
[ 'Loser' 'Winner' 'Score'
0 Harry Hermione 3-7 ...
1 Harry Ron 0-2 ...
2 Ron Voldemort 7-89 ... ]
df2
[ 'Spells' 'Allies'
Harry 23 84 ...
Hermione 94 68 ...
Ron 14 63 ...
Voldemort 97 92 ... ]
then
df3
[ 'Spells' 'Allies'
0 -71 16 ...
1 9 21 ...
2 -83 -29 ... ]
What you need is join:
loser = df1.join(df2, on='Loser').loc[:,['Spells', 'Allies']]
winner = df1.join(df2, on='Winner').loc[:,['Spells', 'Allies']]
df3 = winner - loser
With your example data is gives:
Spells Allies
0 71 -16
1 -9 -21
2 83 29

Categories

Resources