Related
I have the following dataframes (this is just test data), in real samples, I have index values that are repeated a few times inside dataframe 1 and dataframe 2 - this causes the repeated/duplicate rows inside final dataframe.
DataFrame 1:
pd.DataFrame({'id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10},
'first_name': {0: 'Jennee',
1: 'Dagny',
2: 'Correy',
3: 'Pall',
4: 'Julie',
5: 'Janene',
6: 'Lemmy',
7: 'Coleman',
8: 'Beck',
9: 'Che'},
'last_name': {0: 'Strelitzki',
1: 'Dunsire',
2: 'Wickrath',
3: 'Jopp',
4: 'Gheeraert',
5: 'Gawith',
6: 'Farrow',
7: 'Legging',
8: 'Beckwith',
9: 'Burgoin'},
'email': {0: 'jstrelitzki0#google.de',
1: 'ddunsire1#geocities.com',
2: 'cwickrath2#github.com',
3: 'pjopp3#infoseek.co.jp',
4: 'jgheeraert4#theatlantic.com',
5: 'jgawith5#sciencedirect.com',
6: 'lfarrow6#wikimedia.org',
7: 'clegging7#businessinsider.com',
8: 'bbeckwith8#zdnet.com',
9: 'cburgoin9#reference.com'},
'gender': {0: 'Male',
1: 'Female',
2: 'Female',
3: 'Female',
4: 'Female',
5: 'Female',
6: 'Male',
7: 'Female',
8: 'Polygender',
9: 'Male'},
'ip_address': {0: '8.99.68.120',
1: '188.238.129.48',
2: '87.159.243.249',
3: '66.37.174.94',
4: '233.77.128.104',
5: '190.202.131.98',
6: '84.175.231.196',
7: '140.178.100.5',
8: '81.211.179.167',
9: '31.219.69.206'},
'Boolean': {0: False,
1: False,
2: True,
3: True,
4: False,
5: True,
6: True,
7: False,
8: False,
9: False}})
DataFrame 2:
pd.DataFrame({'id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10},
'Model': {0: 2005,
1: 2007,
2: 2011,
3: 2003,
4: 1998,
5: 1992,
6: 1992,
7: 1992,
8: 2008,
9: 1996},
'Make': {0: 'Cadillac',
1: 'Lexus',
2: 'Dodge',
3: 'Dodge',
4: 'Oldsmobile',
5: 'Volkswagen',
6: 'Chevrolet',
7: 'Suzuki',
8: 'Ford',
9: 'Mazda'},
'Colour': {0: 'Red',
1: 'Red',
2: 'Crimson',
3: 'Red',
4: 'Purple',
5: 'Crimson',
6: 'Red',
7: 'Aquamarine',
8: 'Puce',
9: 'Maroon'}})
The two dataframes should be connected based on common Index values found in both dataframes only. Which means, any index values that don't match in those two dataframes; should not appear in the final combined/merged dataframe.
I want to ensure that the final dataframe is unique, and only captures combinations of columns, based on unique Index values.
When I try using the following code, the output is supposed to 'inner join' based on the unique index found in both dataframes.
final = pd.merge(df1, df2, left_index=True, right_index=True)
However, when I try applying the above merge technique on my larger (other) pandas dataframes, there are many rows being repeated/duplicated multiple times. When the merging happpens a few times with more dataframes, the rows gets repeated very frequently, with the same Index value.
I am expecting to see one Index value returned per row (with all the column combinations from each dataframe).
I am not sure why this happens. I can confirm that there is nothing wrong with the datasets.
Is there a better technique of merging those two dataframes, based on only common index values, and at the same time ensure that I don't repeat any rows (with the same index) in my final dataframe ? I often find that this merging often creates a giant final CSV file around 20GB in size too. The source files are only around 15MB into total.
Any help is much appreciated.
My end output should look like this (please copy and use this as Pandas DF):
pd.DataFrame({'id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10},
'first_name': {0: 'Jennee',
1: 'Dagny',
2: 'Correy',
3: 'Pall',
4: 'Julie',
5: 'Janene',
6: 'Lemmy',
7: 'Coleman',
8: 'Beck',
9: 'Che'},
'last_name': {0: 'Strelitzki',
1: 'Dunsire',
2: 'Wickrath',
3: 'Jopp',
4: 'Gheeraert',
5: 'Gawith',
6: 'Farrow',
7: 'Legging',
8: 'Beckwith',
9: 'Burgoin'},
'email': {0: 'jstrelitzki0#google.de',
1: 'ddunsire1#geocities.com',
2: 'cwickrath2#github.com',
3: 'pjopp3#infoseek.co.jp',
4: 'jgheeraert4#theatlantic.com',
5: 'jgawith5#sciencedirect.com',
6: 'lfarrow6#wikimedia.org',
7: 'clegging7#businessinsider.com',
8: 'bbeckwith8#zdnet.com',
9: 'cburgoin9#reference.com'},
'gender': {0: 'Male',
1: 'Female',
2: 'Female',
3: 'Female',
4: 'Female',
5: 'Female',
6: 'Male',
7: 'Female',
8: 'Polygender',
9: 'Male'},
'ip_address': {0: '8.99.68.120',
1: '188.238.129.48',
2: '87.159.243.249',
3: '66.37.174.94',
4: '233.77.128.104',
5: '190.202.131.98',
6: '84.175.231.196',
7: '140.178.100.5',
8: '81.211.179.167',
9: '31.219.69.206'},
'Boolean': {0: False,
1: False,
2: True,
3: True,
4: False,
5: True,
6: True,
7: False,
8: False,
9: False},
'Model': {0: 2005,
1: 2007,
2: 2011,
3: 2003,
4: 1998,
5: 1992,
6: 1992,
7: 1992,
8: 2008,
9: 1996},
'Make': {0: 'Cadillac',
1: 'Lexus',
2: 'Dodge',
3: 'Dodge',
4: 'Oldsmobile',
5: 'Volkswagen',
6: 'Chevrolet',
7: 'Suzuki',
8: 'Ford',
9: 'Mazda'},
'Colour': {0: 'Red',
1: 'Red',
2: 'Crimson',
3: 'Red',
4: 'Purple',
5: 'Crimson',
6: 'Red',
7: 'Aquamarine',
8: 'Puce',
9: 'Maroon'}})
This is expected behavior with non-unique idx values. Since you have 3 ID1 rows in one df and 2 ID1 in the other, you end up with 6 ID1 rows in your merged df. If you add validate="one_to_one" to pd.merge() you will get this Error. MergeError: Merge keys are not unique in either left or right dataset; not a one-to-one mergeAll other validations fail except for many to many.
If it makes sense for your data, you can use the left_on, and right_on parameters to find unique combinations and give you a one-to-one if that's what you're after.
Edit after your new data:
Now that you have unique ids, this should work for you. Notice it doesn't throw a validation error.
final = pd.merge(df1, df2, left_on=['id'], right_on=['id'], validate='one_to_one')
There is a huge df with multiple columns but want to read only specific column that is interested to me:
in the below data, I would like to read only the column 'Type 1'
import numpy as np
import pandas as pd
data = {'Type 1': {0: 1, 1: 3, 2: 5, 3: 'HH', 4: 9, 5: 11, 6: 13, 7: 15, 8: 17},
'Type 2': {0: 'AA',
1: 'BB',
2: 'np.NaN',
3: '55',
4: '3.14',
5: '-96',
6: 'String',
7: 'FFFFFF',
8: 'FEEE'},
'Type 3': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0},
'Type 4': {0: '23',
1: 'fefe',
2: 'abcd',
3: 'dddd',
4: 'dad',
5: 'cfe',
6: 'cf42',
7: '321',
8: '0'},
'Type 5': {0: -120,
1: -120,
2: -120,
3: -120,
4: -120,
5: -120,
6: -120,
7: -120,
8: -120}}
df = pd.DataFrame(data)
df
int_count = df['Type 1'].count(0,numeric_only = True) # should count only cells that contain integers and return 8
total_count = df['Type 1'].count(0,numeric_only = False) # should count all the cells and return 9
I want something like count only the numeric values in particular column
eg: df['Type 1'].count(0,numeric_only = True) should return 8 (exclude counting the string 'HH' in Type 1 column)
df['Type 1'].count(0,numeric_only = False) should return 9 (total number of cells in the particular column)
but "df['Type 1'].count(0,numeric_only = True/False)" this is not working as I expect...
I would suggest the below:
int_count = len(df.loc[df['Type 1'].astype(str).str.isnumeric()])
total_count = len(df)
Here is the df:
{'Type 1': {1: 123.0,
2: 123.0,
3: 123.0,
4: 123.0,
5: 123.0,
6: 45.0,
7: 45.0,
8: 45.0,
9: 45.0,
10: 9.5,
11: 9.5,
12: 9.5,
13: 2.34,
14: 2.34,
15: 2.34},
'Type 2': {1: 0,
2: 0,
3: -90,
4: -90,
5: -90,
6: -90,
7: -90,
8: -270,
9: -270,
10: -270,
11: -270,
12: 180,
13: 180,
14: 181,
15: 181},
'Type 3': {1: 0,
2: 0,
3: 0,
4: 0,
5: 55,
6: 55,
7: 55,
8: 55,
9: 55,
10: 9,
11: 9,
12: 3,
13: 3,
14: 3,
15: 3},
'Type 4': {1: 5.0,
2: 5.0,
3: 5.0,
4: 5.0,
5: 10.0,
6: 123.0,
7: 12.0,
8: 23.0,
9: 16.0,
10: 3.14,
11: 0.0,
12: 0.0,
13: 0.0,
14: 0.0,
15: 18.0},
'Type 5': {1: 65536,
2: 65536,
3: 65536,
4: 65536,
5: 78888888,
6: 665,
7: 665,
8: 665,
9: 665,
10: 665,
11: 665,
12: 665,
13: 665,
14: 665,
15: 665},
'Type 6': {1: 3.4124,
2: 3.4124,
3: 3.4124,
4: 3.4124,
5: 3.4124,
6: 3.4124,
7: 3.4124,
8: 3.4124,
9: 3.4124,
10: 3.4124,
11: 3.4124,
12: 3.4124,
13: 3.4124,
14: 3.4124,
15: 3.4124},
'Type 7': {1: 0,
2: 0,
3: 2,
4: 2,
5: 2,
6: 1,
7: 1,
8: 1,
9: 1,
10: 10,
11: 10,
12: 9,
13: 9,
14: -5,
15: -5},
'Type 8': {1: 'convert the string to 0 and non-zero value to 1',
2: 'convert the string to 0 and non-zero value to 1',
3: 'convert the string to 0 and non-zero value to 1',
4: 'convert the string to 0 and non-zero value to 1',
5: 'convert the string to 0 and non-zero value to 1',
6: 'convert the string to 0 and non-zero value to 1',
7: 'convert the string to 0 and non-zero value to 1',
8: 'convert the string to 0 and non-zero value to 1',
9: 'convert the string to 0 and non-zero value to 1',
10: 'convert the string to 0 and non-zero value to 1',
11: 'convert the string to 0 and non-zero value to 1',
12: 'convert the string to 0 and non-zero value to 1',
13: 'convert the string to 0 and non-zero value to 1',
14: 'convert the string to 0 and non-zero value to 1',
15: 'convert the string to 0 and non-zero value to 1'},
'Type 9': {1: 0,
2: 0,
3: 0,
4: 0,
5: 0,
6: 1,
7: 1,
8: 0,
9: 0,
10: 8,
11: 8,
12: 0,
13: 0,
14: 45,
15: 45}}
each column in the dataframe has a lower and an upper limit as mentioned in the below list
eg:
lower_limit = [3,-90,0,0,0,1,0,0,0] #Type 1 lower limit is 3...
upper_limit = [100,90,50,100,65535,3,1,1,1] #Type 1 upper limit is 100...
lower_limit = pd.Series(lower_limit)
upper_limit = pd.Series(upper_limit)
df.clip(lower_limit, upper_limit, axis = 1)
But this returns every element as nan
whereas the expected result is to clip each column based on the upper limit and lower limit mentioned in the list...
Using for loop, I was able to make the necessary change, but it was extremely slower when the size of df is huge
I understand clipping is the faster way to make the changes to df but it doesnt work as expected, I am doing some mistake in it and advice if any other alternative ways of clipping the columns in a faster way?
From documentation, lower and upper must be float or array-like, not Series.
You could do
lower_limit = [3,-90,0,0,0,1,0,'',0] #Type 1 lower limit is 3...
upper_limit = [100,90,50,100,65535,3,1,'',1] #Type 1 upper limit is 100...
df.clip(lower_limit, upper_limit, axis = 1)
but column Type 8 is as string so you'd get an empty column with clip, you can fix with
lower_limit = [3,-90,0,0,0,1,0,df['Type 8'].min(),0]
upper_limit = [100,90,50,100,65535,3,1,df['Type 8'].max(),1]
I'm trying to import data from Baseball Prospectus into a Python table / dictionary (which would be better?).
Below is what I have, based on following along to Automate The Boring Stuff with Python.
I get that my method isn't properly using these functions, but I can't figure out what tools I should be using.
import requests
import webbrowser
import bs4
res = requests.get('https://legacy.baseballprospectus.com/card/70917/trea-turner')
res.raise_for_status()
webpage = bs4.BeautifulSoup(res.text)
table = webpage.select('newstat_career_log_datagrid')
list = []
for item in table:
list.append(item)
print(list)
Use pandas Data Frame to fetch the MLB Statistics table first and then convert dataframe into dictionary object.If you don't have pandas install you can do it in a single command.
pip install pandas
Then use the below code.
import pandas as pd
df=pd.read_html('https://legacy.baseballprospectus.com/card/70917/trea-turner')
data_dict = df[5].to_dict()
print(data_dict)
Output:
{'PA': {0: 44, 1: 324, 2: 447, 3: 740, 4: 15, 5: 1570}, '2B': {0: 1, 1: 14, 2: 24, 3: 27, 4: 1, 5: 67}, 'TEAM': {0: 'WAS', 1: 'WAS', 2: 'WAS', 3: 'WAS', 4: 'WAS', 5: 'Career'}, 'SB': {0: 2, 1: 33, 2: 46, 3: 43, 4: 4, 5: 128}, 'G': {0: 27, 1: 73, 2: 98, 3: 162, 4: 4, 5: 364}, 'HR': {0: 1, 1: 13, 2: 11, 3: 19, 4: 2, 5: 46}, 'FRAA': {0: 0.5, 1: -3.2, 2: 0.2, 3: 7.1, 4: -0.1, 5: 4.5}, 'BWARP': {0: 0.1, 1: 2.4, 2: 2.7, 3: 5.0, 4: 0.1, 5: 10.4}, 'CS': {0: 2, 1: 6, 2: 8, 3: 9, 4: 0, 5: 25}, '3B': {0: 0, 1: 8, 2: 6, 3: 6, 4: 0, 5: 20}, 'H': {0: 9, 1: 105, 2: 117, 3: 180, 4: 5, 5: 416}, 'AGE': {0: '22', 1: '23', 2: '24', 3: '25', 4: '26', 5: 'Career'}, 'OBP': {0: 0.295, 1: 0.37, 2: 0.33799999999999997, 3: 0.344, 4: 0.4, 5: 0.34700000000000003}, 'AVG': {0: 0.225, 1: 0.342, 2: 0.284, 3: 0.271, 4: 0.35700000000000004, 5: 0.289}, 'DRC+': {0: 77, 1: 128, 2: 99, 3: 107, 4: 103, 5: 108}, 'SO': {0: 12, 1: 59, 2: 80, 3: 132, 4: 5, 5: 288}, 'YEAR': {0: '2015', 1: '2016', 2: '2017', 3: '2018', 4: '2019', 5: 'Career'}, 'SLG': {0: 0.325, 1: 0.5670000000000001, 2: 0.451, 3: 0.41600000000000004, 4: 0.857, 5: 0.46}, 'DRAA': {0: -1.0, 1: 11.4, 2: 1.0, 3: 8.5, 4: 0.1, 5: 20.0}, 'HBP': {0: 0, 1: 1, 2: 4, 3: 5, 4: 0, 5: 10}, 'BRR': {0: 0.1, 1: 5.9, 2: 6.8, 3: 2.7, 4: 0.2, 5: 15.7}, 'BB': {0: 4, 1: 14, 2: 30, 3: 69, 4: 1, 5: 118}}
I am trying to create a function that creates a dataframe based on different lists of words that come up in a certain column of another dataframe.
In my example, I want a dataframe created on the basis of the words "chandos" and "electronics" coming up in the "description" column of the "uncategorised" dataframe.
The point of the function is that I want to be able to run this on different lists of words so I end up with different dataframes containing just the words I want.
words_Telephone = ["tfl", "electronics"]
df_Telephone = pd.DataFrame(columns=['date','description','paid out'])
def categorise(word_list, df_name):
""" takes the denoted terms from the "uncategorised" df and puts it into new df"""
for word in word_list:
df_name = uncategorised[uncategorised['description'].str.contains(word)]
return(df_name)
#apply the function
categorise(words_Telephone, df_Telephone)
I am expecting a dataframe that contains:
d = {'date': {0: '05/04/2017',
1: '06/04/2017',
2: '08/04/2017',
3: '08/04/2017',
4: '08/04/2017',
5: '10/04/2017',
6: '10/04/2017',
7: '10/04/2017'},
'description': {0: 'tfl',
1: 'tfl',
2: 'tfl',
3: 'tfl',
4: 'ac electronics ',
5: 'ac electronics ',},
'index': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10},
'paid out': {0: 3.0,
1: 4.3,
2: 6.1,
3: 1.5,
4: 16.39,
5: 20.4,}}
Reproducible df:
d = {'date': {0: '05/04/2017',
1: '06/04/2017',
2: '06/04/2017',
3: '08/04/2017',
4: '08/04/2017',
5: '08/04/2017',
6: '10/04/2017',
7: '10/04/2017',
8: '10/04/2017'},
'description': {0: 'tfl',
1: 'mu subscription',
2: 'tfl',
3: 'tfl',
4: 'tfl',
5: 'ac electronics ',
6: 'itunes',
7: 'ac electronics ',
8: 'google adwords'},
'index': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10},
'paid out': {0: 3.0,
1: 16.9,
2: 4.3,
3: 6.1,
4: 1.5,
5: 16.39,
6: 12.99,
7: 20.4,
8: 39.68}}
SOLUTION:
def categorise(word_list):
""" takes the denoted terms from the "uncategorised" df and puts it into new df then deletes from the uncategorised df"""
global uncategorised
new_dfs = []
for word in word_list:
new_dfs.append(uncategorised[uncategorised['description'].str.contains(word)])
uncategorised= uncategorised[ ~uncategorised['description'].str.contains(word)]
return (uncategorised)
return (pd.concat(new_dfs).reset_index())
#apply the function
df_Telephone = categorise(words_Telephone)
df_Telephone
words_Telephone = ["tfl", "electronics"]
original_df = pd.DataFrame().from_dict({'date': ['05/04/2017','06/04/2017','06/04/2017','08/04/2017','08/04/2017','08/04/2017','10/04/2017','10/04/2017','10/04/2017'], 'description': ['tfl','mu subscription','tfl','tfl','tfl','ac electronics','itunes','ac electronics','google adwords'], 'paid out' :[ 3.0,16.9, 4.3,6.1,1.5,16.39,12.99,20.4,39.68]})
def categorise(word_list, original_df):
""" takes the denoted terms from the "uncategorised" df and puts it into new df"""
new_dfs = []
for word in word_list:
new_dfs.append(original_df[original_df['description'].str.contains(word)])
return pd.concat(new_dfs).reset_index()
#apply the function
df_Telephone = categorise(words_Telephone, original_df)
print(df_Telephone)
date description paid out
0 05/04/2017 tfl 3.00
1 06/04/2017 tfl 4.30
2 08/04/2017 tfl 6.10
3 08/04/2017 tfl 1.50
4 08/04/2017 ac electronics 16.39
5 10/04/2017 ac electronics 20.40