How to write metachars within the pandas DatFarame in column's names - python

I have CSV file So, when parsing this file with pandas with ISO-8859-1 encoding. However i'm just trying to create a DataFrame df_cols to print only selected columns but it giving the error on execution as it has metachars like / ' (example 'Card Holder's Name', 'CVV/CVV2') hence fails to get the output.
#!/grid/common/pkgs/python/v3.6.1/bin/python3
##### Pandas Display Setting for the complete output on the terminal ####
import pandas as pd
pd.set_option('display.height', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('expand_frame_repr', True)
df_list = pd.read_csv('/docs/Credit_Card.csv', encoding='ISO-8859-1')
df_cols = df_list[['Card Type Full Name', 'Issuing Bank', 'Card Number', 'Card Holder's Name', 'CVV/CVV2', 'Issue Date', 'Expiry Date','Credit Limit']]
print(df_cols)

try put column name in three quotation mark
"""Card Holder's Name"""

Try escaping the single quote character with \
df_cols = df_list[['Card Type Full Name', 'Issuing Bank', 'Card Number', 'Card Holder\'s Name', 'CVV/CVV2', 'Issue Date', 'Expiry Date','Credit Limit']]

As suggested by d_kennetz , we can directly read the columns based on names or index position on the DataFrame df_list itself which will reduce the time and resource utilization( memory consumption) to read the whole CSV.
As mentioned there are two way to read the columns first based on names where we need to be extra careful about special/metachars whereas the second method based on the index position we don't need to care about this which is slightly more useful to avoid this glitch.
df_list = pd.read_csv('/docs/Credit_Card.csv', encoding='ISO-8859-1',usecols=['Card Type Full Name', 'Issuing Bank', 'Card Number', 'Card Holder\'s Name', 'CVV/CVV2', 'Issue Date', 'Expiry Date','Credit Limit'])
OR
df_list = pd.read_csv('/docs/Credit_Card.csv', encoding='ISO-8859-1',usecols=[1, 2, 3, 4, 5, 6, 7, 10])

Related

Why is there no duplicates in pandas dataframe.index?

I just wrote a program for college using pandas to structure some unstructured data. I definitely made it harder than it should be, but I ended up finding something interesting.
here is the data I parsed
Center/Daycare
825 23rd Street South
Arlington, VA 22202
703-979-BABY (2229)
22.
Maria Teresa Desaba, Owner/Director; Tony Saba, Org. >Director.
Website: www.mariateresasbabies.com
Serving children 6 wks to 5yrs full-time.
National Science Foundation Child Development Center
23.
4201 Wilson Blvd., Suite 180 22203
703-292-4794
Website: www.brighthorizons.com 112 children, ages 6 wks - 5 yrs.
7:00 a.m. – 6:00 p.m. Summer Camp for children 5 - 9 years.
here is the (aggressively commented for school)code that is mostly irrelevant but here for completeness sake
import csv
import pandas as pd
lines = []
"""opening the raw data from a text file"""
with open('raw_data.txt') as f:
lines = f.readlines()
f.close()
"""removing new line characters"""
for i in range(len(lines)):
lines[i] = lines[i].rstrip('\n')
df = pd.DataFrame(lines, columns=['info'], index=['business type', 'address', 'location',
'phone number', 'unknown', 'owner', 'website', 'description',
'null', 'business type', 'unknown', 'address', 'phone number',
'website', 'description'])
"""creating more columns with the value at each index. This doesn't contain any duplicates"""
for i in df.index:
df[i] = ''
"""here I am taking every column and adding corresponding values from the original dataframe
extra data frames chould be garbage collected but this serves for demonstration"""
df.index = df.index.astype('str')
df1 = df[df.index.str.contains('bus')]
df2 = df[df.index.str.contains('address')]
df3 = df[df.index.str.contains('location')]
df4 = df[df.index.str.contains('number')]
df5 = df[df.index.str.contains('know')]
df6 = df[df.index.str.contains('owner')]
df7 = df[df.index.str.contains('site')]
df8 = df[df.index.str.contains('descript')]
df9 = df[df.index.str.contains('null')]
for i in range(len(df1)):
df['business type'][i] = df1['info'][i]
for i in range(len(df2)):
df['address'][i] = df2['info'][i]
for i in range(len(df3)):
df['location'][i] = df3['info'][i]
for i in range(len(df4)):
df['phone number'][i] = df4['info'][i]
for i in range(len(df5)):
df['unknown'][i] = df5['info'][i]
for i in range(len(df6)):
df['owner'][i] = df6['info'][i]
for i in range(len(df7)):
df['website'][i] = df7['info'][i]
for i in range(len(df8)):
df['description'][i] = df8['info'][i]
for i in range(len(df9)):
df['null'][i] = df9['info'][i]
"""dropping unnecessary columns"""
df.drop(columns='info', inplace=True)
df.drop(columns='null', inplace=True)
df.drop(columns='unknown', inplace=True)
"""changing the index values to int to make easier to drop unused rows"""
idx = []
for i in range(0, len(df)):
idx.append(i)
df.index = idx
"""dropping unused rows"""
for i in range(2, 15):
df.drop([i], inplace=True)
"""writing to csv and printing to console"""
df.to_csv("new.csv", index=False)
print(df.to_string())
I'm just curious why when I create more columns by using the name of the index[i] item here
df = pd.DataFrame(lines, columns=['info'], index=['business type', 'address', 'location',
'phone number', 'unknown', 'owner', 'website', 'description',
'null', 'business type', 'unknown', 'address', 'phone number',
'website', 'description'])
"""creating more columns with the value at each index. This doesn't contain any duplicates"""
for i in df.index:
df[i] = ''
doesn't contain any duplicates.
when I add
print(df.columns)
I get the output
Index(['info', 'business type', 'address', 'location', 'phone number',
'unknown', 'owner', 'website', 'description', 'null'],
dtype='object')
I'm just generally curious why there are no duplicates as I'm sure that could be problematic in certain situations and also pandas is interesting and I hardly understand it and would like to know more. Also, if you feel extra enthusiastic any info on a more efficient way to do this would be greatly appreciated, but if not no worries, I'll eventually read the docs.
The pandas DataFrame is designed for tabular data in which all the entries in any one column have the same type (e.g. integer or string). One row usually represents one instance, sample, or individual. So the natural way to parse your data into a DataFrame is to have two rows, one for each institution, and define the columns as what you have called index (perhaps with the address split into several columns), e.g. business type, street, city, state, post code, phone number, etc.
So there would be one row per institution, and the index would be used to assign a unique identifier to each of them. That's why it's desirable for the index to contain no duplicates.

Appending to an empty data frame in Pandas using a for loop

I am writing because I am having an issue with a for loop which fills a dataframe when it is empty. Unfortunately, the posts Filling empty python dataframe using loops, Appending to an empty data frame in Pandas?, Creating an empty Pandas DataFrame, then filling it? did not help me to solve it.
My attempt aims, first, at finding the empty dataframes in the list "listDataframe" and then, wants to fill them with some chosen columns. I believe my code is clearer than my explanation. What I can't do is to save the new dataframe using its original name. Here my attempt:
for k,j in zip(listOwner,listDataframe):
for y in j:
if y.empty:
data = pd.DataFrame({"Event Date": list_test_2, "Site Group Name" : k, "Impressions" : 0})
y = pd.concat([data,y])
#y = y.append(data)
where "listOwner", "listDataframe" and "list_test_2" are, respectively, given by:
listOwner = ['OWNER ONE', 'OWNER TWO', 'OWNER THREE', 'OWNER FOUR']
listDataframe = [df_a,df_b,df_c,df_d]
with
df_a = [df_ap_1, df_di_1, df_er_diret_1, df_er_s_1]
df_b = [df_ap_2, df_di_2, df_er_diret_2, df_er_s_2]
df_c = [df_ap_3, df_di_3, df_er_diret_3, df_er_s_3]
df_d = [df_ap_4, df_di_4, df_er_diret_4, df_er_s_4]
and
list_test_2 = []
for i in range(1,8):
f = (datetime.today() - timedelta(days=i)).date()
list_test_2.append(datetime.combine(f, datetime.min.time()))
The empty dataframe were df_ap_1 and df_ap_3. After running the above lines (using both concat and append) if I call these two dataframes they are still empty. Any idea why that happens and how to overcome this issue?
UPDATE
In order to avoid both append and concat, I tried to use the coming attempt (again with no success).
for k,j in zip(listOwner,listDataframe):
for y in j:
if y.empty:
y = pd.DataFrame({"Event Date": list_test_2, "Site Group Name" : k, "Impressions" : 0})
The two desired result should be:
where the first dataframe should be called df_ap_1 while the second one df_ap_3.
Thanks in advance.
Drigo
Here's a way to do it:
import pandas as pd
columns = ['Event Date', 'Site Group Name', 'Impressions']
df_ap_1 = pd.DataFrame(columns=columns) #empty dataframe
df_di_1 = pd.DataFrame(columns=columns) #empty dataframe
df_ap_2 = pd.DataFrame({'Event Date':[1], 'Site Group Name':[2], 'Impressions': [3]}) #non-empty dataframe
df_di_2 = pd.DataFrame(columns=columns) #empty dataframe
df_a = [df_ap_1, df_di_1]
df_b = [df_ap_2, df_di_2]
listDataframe = [df_a,df_b]
list_test_2 = 'foo'
listOwner = ['OWNER ONE', 'OWNER TWO']
def appendOwner(df, owner, list_test_2):
#appends a row to a dataframe for each row in listOwner
new_row = {'Event Date': list_test_2,
'Site Group Name': owner,
'Impressions': 0,
}
df.loc[len(df)] = new_row
for owner, dfList in zip(listOwner, listDataframe):
for df in dfList:
if df.empty:
appendOwner(df, owner, list_test_2)
print(listDataframe)
You can use the appendOwner function to append the rows from listOwner to an empty dataframe.

Iterative loop in dataframe - creating data dictionary dynamically

I have thousands of row in given block structure. In this structure First row - Response Comments, Second row- Customer name and Last row - Recommended are fixed. Rest of the fields/rows are not mandatory.
I am trying to write a code where I am reading Column Name = 'Response Comments' then Key = Column Values of next row (Customer Name).
This should be done from Row - Response Comments to Recommended,
Then breaking a loop and having new key value.
The data is from an Excel file:
from pandas import DataFrame
import pandas as pd
import os
import numpy as np
xl = pd.ExcelFile('Filepath')
df = xl.parse('Reviews_Structured')
print(type (df))
RowNum Column Name Column Values Key
1 Response Comments they have been unresponsive
2 Customer Name Brian
.
.
.
.
13 Recommended no
Any help regarding this loop code will be appreciated.
One way to implement your logic is using collections.defaultdict and a nested dictionary structure. Below is an example:
from collections import defaultdict
import pandas as pd
# input data
df = pd.DataFrame([[1, 'Response Comments', 'they have been unresponsive'],
[2, 'Customer Name', 'Brian'],
.....
[9, 'Recommended', 'yes']],
columns=['RowNum', 'Column Name', 'Column Values'])
# fill Key columns
df['Key'] = df['Column Values'].shift(-1)
df.loc[df['Column Name'] != 'Response Comments', 'Key'] = np.nan
df['Key'] = df['Key'].ffill()
# create defaultdict of dict
d = defaultdict(dict)
# iterate dataframe
for row in df.itertuples():
d[row[4]].update({row[2]: row[3]})
# defaultdict(dict,
# {'April': {'Customer Name': 'April',
# 'Recommended': 'yes',
# 'Response Comments': 'they have been responsive'},
# 'Brian': {'Customer Name': 'Brian',
# 'Recommended': 'no',
# 'Response Comments': 'they have been unresponsive'},
# 'John': {'Customer Name': 'John',
# 'Recommended': 'yes',
# 'Response Comments': 'they have been very responsive'}})
Am I understanding this correctly, that you want a new DataFrame with
columns = ['Response Comments', 'Customer name', ...]
to reshape your data from the parsed excel file?
Create an empty DataFrame from the known, mandatory column names, e.g
df_new = pd.DataFrame(columns=['Response Comments', 'Customer name', ...])
index = 0
iterate over the parsed excel file row by row and assign your values
for k, row in df.iterrows():
index += 1
if row['Column Name'] in df_new:
df_new.at[index, row['Column Name']] = row['Column Values']
if row['Column Name'] == 'Recommended':
continue
Not a beauty, but I'm not quite sure what exactly you're trying to achieve :)

How can i graph this as a stacked bar chart?

I have this code below, it produces my data frame exactly how i want it, but i can't seem to graph it via a grouped bar chart,
I'd like to have the department on the X axis and on the Y axis have completed with the remaining information on top
import pandas as pd
import matplotlib
data = pd.read_excel('audit.xls', skiprows=2, index_col = 'Employee Department')
data.rename(columns = {'Curriculum Name':'Curriculum','Organization Employee Number':'Employee_Number', 'Employee Name': 'Employee','Employee Email':'Email', 'Employee Status':'Status', 'Date Assigned':'Assigned','Completion Date':'Completed'}, inplace=True)
data.drop(['Employee_Number', 'Employee','Assigned', 'Status', 'Manager Name', 'Manager Email', 'Completion Status','Unnamed: 1', 'Unnamed: 5', 'Unnamed: 6'], axis=1, inplace=True)
new_data = data.query('Curriculum ==["CARB Security Training","OIS Support Training","Legal EO Training"]')
new_data2 = new_data.groupby('Employee Department').count().eval('Remaining = Email - Completed', inplace=False)
new_data2
I assume i need to convert it to a pivot table somehow since that's how it is in excel
Have you tried something like this: new_data2[['Completed','Remaining']].plot.bar(stacked=True)
The following example works for me:
df = pd.DataFrame(np.arange(1,10).reshape(3,3), columns=['Email', 'Completed', 'Remaining'], index=['A', 'B', 'C'])
df[['Completed', 'Remaining']].plot.bar(stacked=True)

ValueError: labels ['timestamp'] not contained in axis

I am learning machine learning and I came across this code.
I am trying to run the file "Recommender-Systems.py" from the above source. But it throws an error
ValueError: labels ['timestamp'] not contained in axis. How can it be removed?
Here's a dropbox link of u.data file.
Your data is missing the headers so it's being wrongly inferred by the first row.
You need to change a little bit the Recommender-Systems.py and manually inform the headers.
The right header is available in the README file from your data set.
Change your file to something like this:
## Explore the data (line 27)
data = pd.read_table('u.data', header=None) # header=None avoid getting the columns automatically
data.columns = ['userID', 'itemID',
'rating', 'timestamp'] # Manually set the columns.
data = data.drop('timestamp', axis=1) # Continue with regular work.
...
## Load user information (line 75)
users_info = pd.read_table('u.user', sep='|', header=None)
users_info.columns = ['useID', 'age', 'gender',
'occupation' 'zipcode']
users_info = users_info.set_index('userID')
...
## Load movie information (line 88)
movies_info = pd.read_table('u.item', sep='|', header=None)
movies_info.columns = ['movieID', 'movie title', 'release date',
'video release date', 'IMDb URL', 'unknown',
'Action', 'Adventure', 'Animation', "Children's",
'Comedy', 'Crime', 'Documentary', 'Drama',
'Fantasy', 'Film-Noir', 'Horror', 'Musical',
'Mystery', 'Romance', 'Sci-Fi',' Thriller',
'War', 'Western']
movies_info = movies_info.set_index('movieID')#.drop(low_count_movies)
This should work (but I'm not sure if I got all the right names for the columns).

Categories

Resources