How to convert mulitple columns of a df from hexadecimal to decimal - python

There are multiple columns in the df, out of which only selected columns has to be converted from hexa decimal to decimal
Selected column names are stored in a list A = ["Type 2", "Type 4"]
{'Type 1': {0: 1, 1: 3, 2: 5, 3: 7, 4: 9, 5: 11, 6: 13, 7: 15, 8: 17},
'Type 2': {0: 'AA',
1: 'BB',
2: 'CC',
3: '55',
4: '88',
5: '96',
6: 'FF',
7: 'FFFFFF',
8: 'FEEE'},
'Type 3': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0},
'Type 4': {0: '23',
1: 'fefe',
2: 'abcd',
3: 'dddd',
4: 'dad',
5: 'cfe',
6: 'cf42',
7: '321',
8: '0'},
'Type 5': {0: -120,
1: -120,
2: -120,
3: -120,
4: -120,
5: -120,
6: -120,
7: -120,
8: -120}}

Say, you have the string "AA" in hex.
You can convert hex to decimal like this:
str(int("AA", 16))
Similarly, for a dataframe column that has hexadecimal values, you can use a lambda function.
df['Type2'] = df['Type2'].apply(lambda x: str(int(str(x), 16)))
Assuming, df is the name of the imported dataframe.

You can use pandas.DataFrame.applymap to cast element-wise:
>>> df[["Type 2", "Type 4"]].applymap(lambda n: int(n, 16))
Type 2 Type 4
0 170 35
1 187 65278
2 204 43981
3 85 56797
4 136 3501
5 150 3326
6 255 53058
7 16777215 801
8 65262 0

Related

Join pandas dataframes according to common index value only

I have the following dataframes (this is just test data), in real samples, I have index values that are repeated a few times inside dataframe 1 and dataframe 2 - this causes the repeated/duplicate rows inside final dataframe.
DataFrame 1:
pd.DataFrame({'id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10},
'first_name': {0: 'Jennee',
1: 'Dagny',
2: 'Correy',
3: 'Pall',
4: 'Julie',
5: 'Janene',
6: 'Lemmy',
7: 'Coleman',
8: 'Beck',
9: 'Che'},
'last_name': {0: 'Strelitzki',
1: 'Dunsire',
2: 'Wickrath',
3: 'Jopp',
4: 'Gheeraert',
5: 'Gawith',
6: 'Farrow',
7: 'Legging',
8: 'Beckwith',
9: 'Burgoin'},
'email': {0: 'jstrelitzki0#google.de',
1: 'ddunsire1#geocities.com',
2: 'cwickrath2#github.com',
3: 'pjopp3#infoseek.co.jp',
4: 'jgheeraert4#theatlantic.com',
5: 'jgawith5#sciencedirect.com',
6: 'lfarrow6#wikimedia.org',
7: 'clegging7#businessinsider.com',
8: 'bbeckwith8#zdnet.com',
9: 'cburgoin9#reference.com'},
'gender': {0: 'Male',
1: 'Female',
2: 'Female',
3: 'Female',
4: 'Female',
5: 'Female',
6: 'Male',
7: 'Female',
8: 'Polygender',
9: 'Male'},
'ip_address': {0: '8.99.68.120',
1: '188.238.129.48',
2: '87.159.243.249',
3: '66.37.174.94',
4: '233.77.128.104',
5: '190.202.131.98',
6: '84.175.231.196',
7: '140.178.100.5',
8: '81.211.179.167',
9: '31.219.69.206'},
'Boolean': {0: False,
1: False,
2: True,
3: True,
4: False,
5: True,
6: True,
7: False,
8: False,
9: False}})
DataFrame 2:
pd.DataFrame({'id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10},
'Model': {0: 2005,
1: 2007,
2: 2011,
3: 2003,
4: 1998,
5: 1992,
6: 1992,
7: 1992,
8: 2008,
9: 1996},
'Make': {0: 'Cadillac',
1: 'Lexus',
2: 'Dodge',
3: 'Dodge',
4: 'Oldsmobile',
5: 'Volkswagen',
6: 'Chevrolet',
7: 'Suzuki',
8: 'Ford',
9: 'Mazda'},
'Colour': {0: 'Red',
1: 'Red',
2: 'Crimson',
3: 'Red',
4: 'Purple',
5: 'Crimson',
6: 'Red',
7: 'Aquamarine',
8: 'Puce',
9: 'Maroon'}})
The two dataframes should be connected based on common Index values found in both dataframes only. Which means, any index values that don't match in those two dataframes; should not appear in the final combined/merged dataframe.
I want to ensure that the final dataframe is unique, and only captures combinations of columns, based on unique Index values.
When I try using the following code, the output is supposed to 'inner join' based on the unique index found in both dataframes.
final = pd.merge(df1, df2, left_index=True, right_index=True)
However, when I try applying the above merge technique on my larger (other) pandas dataframes, there are many rows being repeated/duplicated multiple times. When the merging happpens a few times with more dataframes, the rows gets repeated very frequently, with the same Index value.
I am expecting to see one Index value returned per row (with all the column combinations from each dataframe).
I am not sure why this happens. I can confirm that there is nothing wrong with the datasets.
Is there a better technique of merging those two dataframes, based on only common index values, and at the same time ensure that I don't repeat any rows (with the same index) in my final dataframe ? I often find that this merging often creates a giant final CSV file around 20GB in size too. The source files are only around 15MB into total.
Any help is much appreciated.
My end output should look like this (please copy and use this as Pandas DF):
pd.DataFrame({'id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10},
'first_name': {0: 'Jennee',
1: 'Dagny',
2: 'Correy',
3: 'Pall',
4: 'Julie',
5: 'Janene',
6: 'Lemmy',
7: 'Coleman',
8: 'Beck',
9: 'Che'},
'last_name': {0: 'Strelitzki',
1: 'Dunsire',
2: 'Wickrath',
3: 'Jopp',
4: 'Gheeraert',
5: 'Gawith',
6: 'Farrow',
7: 'Legging',
8: 'Beckwith',
9: 'Burgoin'},
'email': {0: 'jstrelitzki0#google.de',
1: 'ddunsire1#geocities.com',
2: 'cwickrath2#github.com',
3: 'pjopp3#infoseek.co.jp',
4: 'jgheeraert4#theatlantic.com',
5: 'jgawith5#sciencedirect.com',
6: 'lfarrow6#wikimedia.org',
7: 'clegging7#businessinsider.com',
8: 'bbeckwith8#zdnet.com',
9: 'cburgoin9#reference.com'},
'gender': {0: 'Male',
1: 'Female',
2: 'Female',
3: 'Female',
4: 'Female',
5: 'Female',
6: 'Male',
7: 'Female',
8: 'Polygender',
9: 'Male'},
'ip_address': {0: '8.99.68.120',
1: '188.238.129.48',
2: '87.159.243.249',
3: '66.37.174.94',
4: '233.77.128.104',
5: '190.202.131.98',
6: '84.175.231.196',
7: '140.178.100.5',
8: '81.211.179.167',
9: '31.219.69.206'},
'Boolean': {0: False,
1: False,
2: True,
3: True,
4: False,
5: True,
6: True,
7: False,
8: False,
9: False},
'Model': {0: 2005,
1: 2007,
2: 2011,
3: 2003,
4: 1998,
5: 1992,
6: 1992,
7: 1992,
8: 2008,
9: 1996},
'Make': {0: 'Cadillac',
1: 'Lexus',
2: 'Dodge',
3: 'Dodge',
4: 'Oldsmobile',
5: 'Volkswagen',
6: 'Chevrolet',
7: 'Suzuki',
8: 'Ford',
9: 'Mazda'},
'Colour': {0: 'Red',
1: 'Red',
2: 'Crimson',
3: 'Red',
4: 'Purple',
5: 'Crimson',
6: 'Red',
7: 'Aquamarine',
8: 'Puce',
9: 'Maroon'}})
This is expected behavior with non-unique idx values. Since you have 3 ID1 rows in one df and 2 ID1 in the other, you end up with 6 ID1 rows in your merged df. If you add validate="one_to_one" to pd.merge() you will get this Error. MergeError: Merge keys are not unique in either left or right dataset; not a one-to-one mergeAll other validations fail except for many to many.
If it makes sense for your data, you can use the left_on, and right_on parameters to find unique combinations and give you a one-to-one if that's what you're after.
Edit after your new data:
Now that you have unique ids, this should work for you. Notice it doesn't throw a validation error.
final = pd.merge(df1, df2, left_on=['id'], right_on=['id'], validate='one_to_one')

In python pandas, count the integers in a particular column and also count all the elements in particular column

There is a huge df with multiple columns but want to read only specific column that is interested to me:
in the below data, I would like to read only the column 'Type 1'
import numpy as np
import pandas as pd
data = {'Type 1': {0: 1, 1: 3, 2: 5, 3: 'HH', 4: 9, 5: 11, 6: 13, 7: 15, 8: 17},
'Type 2': {0: 'AA',
1: 'BB',
2: 'np.NaN',
3: '55',
4: '3.14',
5: '-96',
6: 'String',
7: 'FFFFFF',
8: 'FEEE'},
'Type 3': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0},
'Type 4': {0: '23',
1: 'fefe',
2: 'abcd',
3: 'dddd',
4: 'dad',
5: 'cfe',
6: 'cf42',
7: '321',
8: '0'},
'Type 5': {0: -120,
1: -120,
2: -120,
3: -120,
4: -120,
5: -120,
6: -120,
7: -120,
8: -120}}
df = pd.DataFrame(data)
df
int_count = df['Type 1'].count(0,numeric_only = True) # should count only cells that contain integers and return 8
total_count = df['Type 1'].count(0,numeric_only = False) # should count all the cells and return 9
I want something like count only the numeric values in particular column
eg: df['Type 1'].count(0,numeric_only = True) should return 8 (exclude counting the string 'HH' in Type 1 column)
df['Type 1'].count(0,numeric_only = False) should return 9 (total number of cells in the particular column)
but "df['Type 1'].count(0,numeric_only = True/False)" this is not working as I expect...
I would suggest the below:
int_count = len(df.loc[df['Type 1'].astype(str).str.isnumeric()])
total_count = len(df)

how to clip pandas for a multiple column in a data frame

Here is the df:
{'Type 1': {1: 123.0,
2: 123.0,
3: 123.0,
4: 123.0,
5: 123.0,
6: 45.0,
7: 45.0,
8: 45.0,
9: 45.0,
10: 9.5,
11: 9.5,
12: 9.5,
13: 2.34,
14: 2.34,
15: 2.34},
'Type 2': {1: 0,
2: 0,
3: -90,
4: -90,
5: -90,
6: -90,
7: -90,
8: -270,
9: -270,
10: -270,
11: -270,
12: 180,
13: 180,
14: 181,
15: 181},
'Type 3': {1: 0,
2: 0,
3: 0,
4: 0,
5: 55,
6: 55,
7: 55,
8: 55,
9: 55,
10: 9,
11: 9,
12: 3,
13: 3,
14: 3,
15: 3},
'Type 4': {1: 5.0,
2: 5.0,
3: 5.0,
4: 5.0,
5: 10.0,
6: 123.0,
7: 12.0,
8: 23.0,
9: 16.0,
10: 3.14,
11: 0.0,
12: 0.0,
13: 0.0,
14: 0.0,
15: 18.0},
'Type 5': {1: 65536,
2: 65536,
3: 65536,
4: 65536,
5: 78888888,
6: 665,
7: 665,
8: 665,
9: 665,
10: 665,
11: 665,
12: 665,
13: 665,
14: 665,
15: 665},
'Type 6': {1: 3.4124,
2: 3.4124,
3: 3.4124,
4: 3.4124,
5: 3.4124,
6: 3.4124,
7: 3.4124,
8: 3.4124,
9: 3.4124,
10: 3.4124,
11: 3.4124,
12: 3.4124,
13: 3.4124,
14: 3.4124,
15: 3.4124},
'Type 7': {1: 0,
2: 0,
3: 2,
4: 2,
5: 2,
6: 1,
7: 1,
8: 1,
9: 1,
10: 10,
11: 10,
12: 9,
13: 9,
14: -5,
15: -5},
'Type 8': {1: 'convert the string to 0 and non-zero value to 1',
2: 'convert the string to 0 and non-zero value to 1',
3: 'convert the string to 0 and non-zero value to 1',
4: 'convert the string to 0 and non-zero value to 1',
5: 'convert the string to 0 and non-zero value to 1',
6: 'convert the string to 0 and non-zero value to 1',
7: 'convert the string to 0 and non-zero value to 1',
8: 'convert the string to 0 and non-zero value to 1',
9: 'convert the string to 0 and non-zero value to 1',
10: 'convert the string to 0 and non-zero value to 1',
11: 'convert the string to 0 and non-zero value to 1',
12: 'convert the string to 0 and non-zero value to 1',
13: 'convert the string to 0 and non-zero value to 1',
14: 'convert the string to 0 and non-zero value to 1',
15: 'convert the string to 0 and non-zero value to 1'},
'Type 9': {1: 0,
2: 0,
3: 0,
4: 0,
5: 0,
6: 1,
7: 1,
8: 0,
9: 0,
10: 8,
11: 8,
12: 0,
13: 0,
14: 45,
15: 45}}
each column in the dataframe has a lower and an upper limit as mentioned in the below list
eg:
lower_limit = [3,-90,0,0,0,1,0,0,0] #Type 1 lower limit is 3...
upper_limit = [100,90,50,100,65535,3,1,1,1] #Type 1 upper limit is 100...
lower_limit = pd.Series(lower_limit)
upper_limit = pd.Series(upper_limit)
df.clip(lower_limit, upper_limit, axis = 1)
But this returns every element as nan
whereas the expected result is to clip each column based on the upper limit and lower limit mentioned in the list...
Using for loop, I was able to make the necessary change, but it was extremely slower when the size of df is huge
I understand clipping is the faster way to make the changes to df but it doesnt work as expected, I am doing some mistake in it and advice if any other alternative ways of clipping the columns in a faster way?
From documentation, lower and upper must be float or array-like, not Series.
You could do
lower_limit = [3,-90,0,0,0,1,0,'',0] #Type 1 lower limit is 3...
upper_limit = [100,90,50,100,65535,3,1,'',1] #Type 1 upper limit is 100...
df.clip(lower_limit, upper_limit, axis = 1)
but column Type 8 is as string so you'd get an empty column with clip, you can fix with
lower_limit = [3,-90,0,0,0,1,0,df['Type 8'].min(),0]
upper_limit = [100,90,50,100,65535,3,1,df['Type 8'].max(),1]

How can I scrape data from an HTML table into a Python list/dict?

I'm trying to import data from Baseball Prospectus into a Python table / dictionary (which would be better?).
Below is what I have, based on following along to Automate The Boring Stuff with Python.
I get that my method isn't properly using these functions, but I can't figure out what tools I should be using.
import requests
import webbrowser
import bs4
res = requests.get('https://legacy.baseballprospectus.com/card/70917/trea-turner')
res.raise_for_status()
webpage = bs4.BeautifulSoup(res.text)
table = webpage.select('newstat_career_log_datagrid')
list = []
for item in table:
list.append(item)
print(list)
Use pandas Data Frame to fetch the MLB Statistics table first and then convert dataframe into dictionary object.If you don't have pandas install you can do it in a single command.
pip install pandas
Then use the below code.
import pandas as pd
df=pd.read_html('https://legacy.baseballprospectus.com/card/70917/trea-turner')
data_dict = df[5].to_dict()
print(data_dict)
Output:
{'PA': {0: 44, 1: 324, 2: 447, 3: 740, 4: 15, 5: 1570}, '2B': {0: 1, 1: 14, 2: 24, 3: 27, 4: 1, 5: 67}, 'TEAM': {0: 'WAS', 1: 'WAS', 2: 'WAS', 3: 'WAS', 4: 'WAS', 5: 'Career'}, 'SB': {0: 2, 1: 33, 2: 46, 3: 43, 4: 4, 5: 128}, 'G': {0: 27, 1: 73, 2: 98, 3: 162, 4: 4, 5: 364}, 'HR': {0: 1, 1: 13, 2: 11, 3: 19, 4: 2, 5: 46}, 'FRAA': {0: 0.5, 1: -3.2, 2: 0.2, 3: 7.1, 4: -0.1, 5: 4.5}, 'BWARP': {0: 0.1, 1: 2.4, 2: 2.7, 3: 5.0, 4: 0.1, 5: 10.4}, 'CS': {0: 2, 1: 6, 2: 8, 3: 9, 4: 0, 5: 25}, '3B': {0: 0, 1: 8, 2: 6, 3: 6, 4: 0, 5: 20}, 'H': {0: 9, 1: 105, 2: 117, 3: 180, 4: 5, 5: 416}, 'AGE': {0: '22', 1: '23', 2: '24', 3: '25', 4: '26', 5: 'Career'}, 'OBP': {0: 0.295, 1: 0.37, 2: 0.33799999999999997, 3: 0.344, 4: 0.4, 5: 0.34700000000000003}, 'AVG': {0: 0.225, 1: 0.342, 2: 0.284, 3: 0.271, 4: 0.35700000000000004, 5: 0.289}, 'DRC+': {0: 77, 1: 128, 2: 99, 3: 107, 4: 103, 5: 108}, 'SO': {0: 12, 1: 59, 2: 80, 3: 132, 4: 5, 5: 288}, 'YEAR': {0: '2015', 1: '2016', 2: '2017', 3: '2018', 4: '2019', 5: 'Career'}, 'SLG': {0: 0.325, 1: 0.5670000000000001, 2: 0.451, 3: 0.41600000000000004, 4: 0.857, 5: 0.46}, 'DRAA': {0: -1.0, 1: 11.4, 2: 1.0, 3: 8.5, 4: 0.1, 5: 20.0}, 'HBP': {0: 0, 1: 1, 2: 4, 3: 5, 4: 0, 5: 10}, 'BRR': {0: 0.1, 1: 5.9, 2: 6.8, 3: 2.7, 4: 0.2, 5: 15.7}, 'BB': {0: 4, 1: 14, 2: 30, 3: 69, 4: 1, 5: 118}}

Function to merge pandas dataframes based on different keywords

I am trying to create a function that creates a dataframe based on different lists of words that come up in a certain column of another dataframe.
In my example, I want a dataframe created on the basis of the words "chandos" and "electronics" coming up in the "description" column of the "uncategorised" dataframe.
The point of the function is that I want to be able to run this on different lists of words so I end up with different dataframes containing just the words I want.
words_Telephone = ["tfl", "electronics"]
df_Telephone = pd.DataFrame(columns=['date','description','paid out'])
def categorise(word_list, df_name):
""" takes the denoted terms from the "uncategorised" df and puts it into new df"""
for word in word_list:
df_name = uncategorised[uncategorised['description'].str.contains(word)]
return(df_name)
#apply the function
categorise(words_Telephone, df_Telephone)
I am expecting a dataframe that contains:
d = {'date': {0: '05/04/2017',
1: '06/04/2017',
2: '08/04/2017',
3: '08/04/2017',
4: '08/04/2017',
5: '10/04/2017',
6: '10/04/2017',
7: '10/04/2017'},
'description': {0: 'tfl',
1: 'tfl',
2: 'tfl',
3: 'tfl',
4: 'ac electronics ',
5: 'ac electronics ',},
'index': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10},
'paid out': {0: 3.0,
1: 4.3,
2: 6.1,
3: 1.5,
4: 16.39,
5: 20.4,}}
Reproducible df:
d = {'date': {0: '05/04/2017',
1: '06/04/2017',
2: '06/04/2017',
3: '08/04/2017',
4: '08/04/2017',
5: '08/04/2017',
6: '10/04/2017',
7: '10/04/2017',
8: '10/04/2017'},
'description': {0: 'tfl',
1: 'mu subscription',
2: 'tfl',
3: 'tfl',
4: 'tfl',
5: 'ac electronics ',
6: 'itunes',
7: 'ac electronics ',
8: 'google adwords'},
'index': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10},
'paid out': {0: 3.0,
1: 16.9,
2: 4.3,
3: 6.1,
4: 1.5,
5: 16.39,
6: 12.99,
7: 20.4,
8: 39.68}}
SOLUTION:
def categorise(word_list):
""" takes the denoted terms from the "uncategorised" df and puts it into new df then deletes from the uncategorised df"""
global uncategorised
new_dfs = []
for word in word_list:
new_dfs.append(uncategorised[uncategorised['description'].str.contains(word)])
uncategorised= uncategorised[ ~uncategorised['description'].str.contains(word)]
return (uncategorised)
return (pd.concat(new_dfs).reset_index())
#apply the function
df_Telephone = categorise(words_Telephone)
df_Telephone
words_Telephone = ["tfl", "electronics"]
original_df = pd.DataFrame().from_dict({'date': ['05/04/2017','06/04/2017','06/04/2017','08/04/2017','08/04/2017','08/04/2017','10/04/2017','10/04/2017','10/04/2017'], 'description': ['tfl','mu subscription','tfl','tfl','tfl','ac electronics','itunes','ac electronics','google adwords'], 'paid out' :[ 3.0,16.9, 4.3,6.1,1.5,16.39,12.99,20.4,39.68]})
def categorise(word_list, original_df):
""" takes the denoted terms from the "uncategorised" df and puts it into new df"""
new_dfs = []
for word in word_list:
new_dfs.append(original_df[original_df['description'].str.contains(word)])
return pd.concat(new_dfs).reset_index()
#apply the function
df_Telephone = categorise(words_Telephone, original_df)
print(df_Telephone)
date description paid out
0 05/04/2017 tfl 3.00
1 06/04/2017 tfl 4.30
2 08/04/2017 tfl 6.10
3 08/04/2017 tfl 1.50
4 08/04/2017 ac electronics 16.39
5 10/04/2017 ac electronics 20.40

Categories

Resources