I have thousands of row in given block structure. In this structure First row - Response Comments, Second row- Customer name and Last row - Recommended are fixed. Rest of the fields/rows are not mandatory.
I am trying to write a code where I am reading Column Name = 'Response Comments' then Key = Column Values of next row (Customer Name).
This should be done from Row - Response Comments to Recommended,
Then breaking a loop and having new key value.
The data is from an Excel file:
from pandas import DataFrame
import pandas as pd
import os
import numpy as np
xl = pd.ExcelFile('Filepath')
df = xl.parse('Reviews_Structured')
print(type (df))
RowNum Column Name Column Values Key
1 Response Comments they have been unresponsive
2 Customer Name Brian
.
.
.
.
13 Recommended no
Any help regarding this loop code will be appreciated.
One way to implement your logic is using collections.defaultdict and a nested dictionary structure. Below is an example:
from collections import defaultdict
import pandas as pd
# input data
df = pd.DataFrame([[1, 'Response Comments', 'they have been unresponsive'],
[2, 'Customer Name', 'Brian'],
.....
[9, 'Recommended', 'yes']],
columns=['RowNum', 'Column Name', 'Column Values'])
# fill Key columns
df['Key'] = df['Column Values'].shift(-1)
df.loc[df['Column Name'] != 'Response Comments', 'Key'] = np.nan
df['Key'] = df['Key'].ffill()
# create defaultdict of dict
d = defaultdict(dict)
# iterate dataframe
for row in df.itertuples():
d[row[4]].update({row[2]: row[3]})
# defaultdict(dict,
# {'April': {'Customer Name': 'April',
# 'Recommended': 'yes',
# 'Response Comments': 'they have been responsive'},
# 'Brian': {'Customer Name': 'Brian',
# 'Recommended': 'no',
# 'Response Comments': 'they have been unresponsive'},
# 'John': {'Customer Name': 'John',
# 'Recommended': 'yes',
# 'Response Comments': 'they have been very responsive'}})
Am I understanding this correctly, that you want a new DataFrame with
columns = ['Response Comments', 'Customer name', ...]
to reshape your data from the parsed excel file?
Create an empty DataFrame from the known, mandatory column names, e.g
df_new = pd.DataFrame(columns=['Response Comments', 'Customer name', ...])
index = 0
iterate over the parsed excel file row by row and assign your values
for k, row in df.iterrows():
index += 1
if row['Column Name'] in df_new:
df_new.at[index, row['Column Name']] = row['Column Values']
if row['Column Name'] == 'Recommended':
continue
Not a beauty, but I'm not quite sure what exactly you're trying to achieve :)
Related
I have a large data file as shown below.
Edited to include an updated example:
I wanted to add two new columns (E and F) next to column D and move the suite # when applicable and City/State data in cell D3 and D4 to E2 and F2, respectively. The challenge is not every entry has the suite number. I would need to insert a row first for those entries that don't have the suite number, only for them, not for those that already have the suite information.
I know how to do loops, but am having trouble to define the conditions. One way is to count the length of the string. How should I get started? Much appreciate your help!
This is how I would do it. I don't recommend looping when using pandas. There are a lot of tools that it is often not needed. Some caution on this. Your spreadsheet has NaN and I think that is actually numpy np.nan equivalent. You also have blanks I am thinking that it is a "" equivalent.
import pandas as pd
import numpy as np
# dictionary of your data
companies = {
'Comp ID': ['C1', '', np.nan, 'C2', '', np.nan, 'C3',np.nan],
'Address': ['10 foo', 'Suite A','foo city', '11 spam','STE 100','spam town', '12 ham', 'Myhammy'],
'phone': ['888-321-4567', '', np.nan, '888-321-4567', '', np.nan, '888-321-4567',np.nan],
'Type': ['W_sale', '', np.nan, 'W_sale', '', np.nan, 'W_sale',np.nan],
}
# make the frames needed.
df = pd.DataFrame( companies)
df1 = pd.DataFrame() # blank frame for suite and town columns
# Edit here to TEST the data types
for r in range(0, 5):
v = df['Comp ID'].values[r]
print(f'this "{v}" is a ', type(v))
# So this will tell us the data types so we can construct our where(). Back to prior answer....
# Need a where clause it is similar to a if() statement in excel
df1['Suite'] = np.where( df['Comp ID']=='', df['Address'], np.nan)
df1['City/State'] = np.where( df['Comp ID'].isna(), df['Address'], np.nan)
# copy values to rows above
df1 = df1[['Suite','City/State']].backfill()
# joint the frames together on index
df = df.join(df1)
df.drop_duplicates(subset=['City/State'], keep='first', inplace=True)
# set the column order to what you want
df = df[['Comp ID', 'Type', 'Address', 'Suite', 'City/State', 'phone' ]]
output
Comp ID
Type
Address
Suite
City/State
phone
C1
W_sale
10 foo
Suite A
foo city
888-321-4567
C2
W_sale
11 spam
STE 100
spam town
888-321-4567
C3
W_sale
12 ham
Myhammy
888-321-4567
Edit: the numpy where statement:
numpy is brought in by the line import numpy as np at the top. We are creating calculated column that is based on the 'Comp ID' column. The numpy does this without loops. Think of the where like an excel IF() function.
df1(return value) = np.where(df[test] > condition, true, false)
The pandas backfill
Some times you have a value that is in a cell below and you want to duplicate it for the blank cell above it. So you backfill. df1 = df1[['Suite','City/State']].backfill().
I have a request that gets me some data that looks like this:
[{'__rowType': 'META',
'__type': 'units',
'data': [{'name': 'units.unit', 'type': 'STRING'},
{'name': 'units.classification', 'type': 'STRING'}]},
{'__rowType': 'DATA', '__type': 'units', 'data': ['A', 'Energie']},
{'__rowType': 'DATA', '__type': 'units', 'data': ['bar', ' ']},
{'__rowType': 'DATA', '__type': 'units', 'data': ['CCM', 'Volumen']},
{'__rowType': 'DATA', '__type': 'units', 'data': ['CDM', 'Volumen']}]
and would like to construct a (Pandas) DataFrame that looks like this:
Things like pd.DataFrame(pd.json_normalize(test)['data'] are close but still throw the whole list into the column instead of making separate columns. record_path sounded right but I can't get it to work correctly either.
Any help?
It's difficult to know how the example generalizes, but for this particular case you could use:
pd.DataFrame([d['data'] for d in test
if d.get('__rowType', None)=='DATA' and 'data' in d],
columns=['unit', 'classification']
)
NB. assuming test the input list
output:
unit classification
0 A Energie
1 bar
2 CCM Volumen
3 CDM Volumen
Instead of just giving you the code, first I explain how you can do this by details and then I'll show you the exact steps to follow and the final code. This way you understand everything for any further situation.
When you want to create a pandas dataframe with two columns you can do this by creating a dictionary and passing it to DataFrame class:
my_data = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=my_data)
This will result in this dataframe:
So if you want to have the dataframe you specified in your question the my_data dictionary should be like this:
my_data = {
'unit': ['A', 'bar', 'CCM', 'CDM'],
'classification': ['Energie', '', 'Volumen', 'Volumen'],
}
df = pd.DataFrame(data=my_data, )
df.index = np.arange(1, len(df)+1)
df
(You can see the df.index=... part. This is because that the index column of the desired dataframe is started at 1 in your question)
So if you want to do so you just have to extract these data from the data you provided and convert them to the exact dictionary mentioned above (my_data dictionary)
To do so you can do this:
# This will get the data values like 'bar', 'CCM' and etc from your initial data
values = [x['data'] for x in d if x['__rowType']=='DATA']
# This gets the columns names from meta data
meta = list(filter(lambda x: x['__rowType']=='META', d))[0]
columns = [x['name'].split('.')[-1] for x in meta['data']]
# This line creates the exact dictionary we need to send to DataFrame class.
my_data = {column:[v[i] for v in values] for i, column in enumerate(columns)}
So the whole code would be this:
d = YOUR_DATA
# This will get the data values like 'bar', 'CCM' and etc
values = [x['data'] for x in d if x['__rowType']=='DATA']
# This gets the columns names from meta data
meta = list(filter(lambda x: x['__rowType']=='META', d))[0]
columns = [x['name'].split('.')[-1] for x in meta['data']]
# This line creates the exact dictionary we need to send to DataFrame class.
my_data = {column:[v[i] for v in values] for i, column in enumerate(columns)}
df = pd.DataFrame(data=my_data, )
df.index = np.arange(1, len(df)+1)
df #or print(df)
Note: Of course you can do all of this in one complex line of code but to avoid confusion I decided to do this in couple of lines of code
I just wrote a program for college using pandas to structure some unstructured data. I definitely made it harder than it should be, but I ended up finding something interesting.
here is the data I parsed
Center/Daycare
825 23rd Street South
Arlington, VA 22202
703-979-BABY (2229)
22.
Maria Teresa Desaba, Owner/Director; Tony Saba, Org. >Director.
Website: www.mariateresasbabies.com
Serving children 6 wks to 5yrs full-time.
National Science Foundation Child Development Center
23.
4201 Wilson Blvd., Suite 180 22203
703-292-4794
Website: www.brighthorizons.com 112 children, ages 6 wks - 5 yrs.
7:00 a.m. – 6:00 p.m. Summer Camp for children 5 - 9 years.
here is the (aggressively commented for school)code that is mostly irrelevant but here for completeness sake
import csv
import pandas as pd
lines = []
"""opening the raw data from a text file"""
with open('raw_data.txt') as f:
lines = f.readlines()
f.close()
"""removing new line characters"""
for i in range(len(lines)):
lines[i] = lines[i].rstrip('\n')
df = pd.DataFrame(lines, columns=['info'], index=['business type', 'address', 'location',
'phone number', 'unknown', 'owner', 'website', 'description',
'null', 'business type', 'unknown', 'address', 'phone number',
'website', 'description'])
"""creating more columns with the value at each index. This doesn't contain any duplicates"""
for i in df.index:
df[i] = ''
"""here I am taking every column and adding corresponding values from the original dataframe
extra data frames chould be garbage collected but this serves for demonstration"""
df.index = df.index.astype('str')
df1 = df[df.index.str.contains('bus')]
df2 = df[df.index.str.contains('address')]
df3 = df[df.index.str.contains('location')]
df4 = df[df.index.str.contains('number')]
df5 = df[df.index.str.contains('know')]
df6 = df[df.index.str.contains('owner')]
df7 = df[df.index.str.contains('site')]
df8 = df[df.index.str.contains('descript')]
df9 = df[df.index.str.contains('null')]
for i in range(len(df1)):
df['business type'][i] = df1['info'][i]
for i in range(len(df2)):
df['address'][i] = df2['info'][i]
for i in range(len(df3)):
df['location'][i] = df3['info'][i]
for i in range(len(df4)):
df['phone number'][i] = df4['info'][i]
for i in range(len(df5)):
df['unknown'][i] = df5['info'][i]
for i in range(len(df6)):
df['owner'][i] = df6['info'][i]
for i in range(len(df7)):
df['website'][i] = df7['info'][i]
for i in range(len(df8)):
df['description'][i] = df8['info'][i]
for i in range(len(df9)):
df['null'][i] = df9['info'][i]
"""dropping unnecessary columns"""
df.drop(columns='info', inplace=True)
df.drop(columns='null', inplace=True)
df.drop(columns='unknown', inplace=True)
"""changing the index values to int to make easier to drop unused rows"""
idx = []
for i in range(0, len(df)):
idx.append(i)
df.index = idx
"""dropping unused rows"""
for i in range(2, 15):
df.drop([i], inplace=True)
"""writing to csv and printing to console"""
df.to_csv("new.csv", index=False)
print(df.to_string())
I'm just curious why when I create more columns by using the name of the index[i] item here
df = pd.DataFrame(lines, columns=['info'], index=['business type', 'address', 'location',
'phone number', 'unknown', 'owner', 'website', 'description',
'null', 'business type', 'unknown', 'address', 'phone number',
'website', 'description'])
"""creating more columns with the value at each index. This doesn't contain any duplicates"""
for i in df.index:
df[i] = ''
doesn't contain any duplicates.
when I add
print(df.columns)
I get the output
Index(['info', 'business type', 'address', 'location', 'phone number',
'unknown', 'owner', 'website', 'description', 'null'],
dtype='object')
I'm just generally curious why there are no duplicates as I'm sure that could be problematic in certain situations and also pandas is interesting and I hardly understand it and would like to know more. Also, if you feel extra enthusiastic any info on a more efficient way to do this would be greatly appreciated, but if not no worries, I'll eventually read the docs.
The pandas DataFrame is designed for tabular data in which all the entries in any one column have the same type (e.g. integer or string). One row usually represents one instance, sample, or individual. So the natural way to parse your data into a DataFrame is to have two rows, one for each institution, and define the columns as what you have called index (perhaps with the address split into several columns), e.g. business type, street, city, state, post code, phone number, etc.
So there would be one row per institution, and the index would be used to assign a unique identifier to each of them. That's why it's desirable for the index to contain no duplicates.
How can I add outputs of different for loops into one dataframe. For example I have scraped data from website and have list of Names,Email and phone number using loops. I want to add all outputs into a table in single dataframe.
I am able to do it for One single loop but not for multiple loops.
Please look at the code and output in attached images.
By removing Zip from for loop its giving error. "Too many values to unpack"
Loop
phone = soup.find_all(class_ = "directory_item_phone directory_item_info_item")
for phn in phone:
print(phn.text.strip())
##Output - List of Numbers
Code for df
df = list()
for name,mail,phn in zip(faculty_name,email,phone):
df.append(name.text.strip())
df.append(mail.text.strip())
df.append(phn.text.strip())
df = pd.DataFrame(df)
df
For loops
Code and Output for df
An efficient way to create a pandas.DataFrame is to first create a dict and then convert it into a DataFrame.
In your case you probably could do :
import pandas as pd
D = {'name': [], 'mail': [], 'phone': []}
for name, mail, phn in zip(faculty_name, email, phone):
D['name'].append(name.text.strip())
D['mail'].append(mail.text.strip())
D['phone'].append(phn.text.strip())
df = pd.DataFrame(D)
Another way with a lambda function :
import pandas as pd
text_strip = lambda s : s.text.strip()
D = {
'name': list(map(text_strip, faculty_name)),
'mail': list(map(text_strip, email)),
'phone': list(map(text_strip, phone))
}
df = pd.DataFrame(D)
If lists don't all have the same length you may try this (but I am not sure that is very efficient) :
import pandas as pd
columns_names = ['name', 'mail', 'phone']
all_lists = [faculty_name, email, phone]
max_lenght = max(map(len, all_lists))
D = {c_name: [None]*max_lenght for c_name in columns_names}
for c_name, l in zip(columns_names , all_lists):
for ind, element in enumerate(l):
D[c_name][ind] = element
df = pd.DataFrame(D)
Try this,
data = {'name':[name.text.strip() for name in faculty_name],
'mail':[mail.text.strip() for mail in email],
'phn':[phn.text.strip() for phn in phone],}
df = pd.DataFrame.from_dict(data)
I am writing because I am having an issue with a for loop which fills a dataframe when it is empty. Unfortunately, the posts Filling empty python dataframe using loops, Appending to an empty data frame in Pandas?, Creating an empty Pandas DataFrame, then filling it? did not help me to solve it.
My attempt aims, first, at finding the empty dataframes in the list "listDataframe" and then, wants to fill them with some chosen columns. I believe my code is clearer than my explanation. What I can't do is to save the new dataframe using its original name. Here my attempt:
for k,j in zip(listOwner,listDataframe):
for y in j:
if y.empty:
data = pd.DataFrame({"Event Date": list_test_2, "Site Group Name" : k, "Impressions" : 0})
y = pd.concat([data,y])
#y = y.append(data)
where "listOwner", "listDataframe" and "list_test_2" are, respectively, given by:
listOwner = ['OWNER ONE', 'OWNER TWO', 'OWNER THREE', 'OWNER FOUR']
listDataframe = [df_a,df_b,df_c,df_d]
with
df_a = [df_ap_1, df_di_1, df_er_diret_1, df_er_s_1]
df_b = [df_ap_2, df_di_2, df_er_diret_2, df_er_s_2]
df_c = [df_ap_3, df_di_3, df_er_diret_3, df_er_s_3]
df_d = [df_ap_4, df_di_4, df_er_diret_4, df_er_s_4]
and
list_test_2 = []
for i in range(1,8):
f = (datetime.today() - timedelta(days=i)).date()
list_test_2.append(datetime.combine(f, datetime.min.time()))
The empty dataframe were df_ap_1 and df_ap_3. After running the above lines (using both concat and append) if I call these two dataframes they are still empty. Any idea why that happens and how to overcome this issue?
UPDATE
In order to avoid both append and concat, I tried to use the coming attempt (again with no success).
for k,j in zip(listOwner,listDataframe):
for y in j:
if y.empty:
y = pd.DataFrame({"Event Date": list_test_2, "Site Group Name" : k, "Impressions" : 0})
The two desired result should be:
where the first dataframe should be called df_ap_1 while the second one df_ap_3.
Thanks in advance.
Drigo
Here's a way to do it:
import pandas as pd
columns = ['Event Date', 'Site Group Name', 'Impressions']
df_ap_1 = pd.DataFrame(columns=columns) #empty dataframe
df_di_1 = pd.DataFrame(columns=columns) #empty dataframe
df_ap_2 = pd.DataFrame({'Event Date':[1], 'Site Group Name':[2], 'Impressions': [3]}) #non-empty dataframe
df_di_2 = pd.DataFrame(columns=columns) #empty dataframe
df_a = [df_ap_1, df_di_1]
df_b = [df_ap_2, df_di_2]
listDataframe = [df_a,df_b]
list_test_2 = 'foo'
listOwner = ['OWNER ONE', 'OWNER TWO']
def appendOwner(df, owner, list_test_2):
#appends a row to a dataframe for each row in listOwner
new_row = {'Event Date': list_test_2,
'Site Group Name': owner,
'Impressions': 0,
}
df.loc[len(df)] = new_row
for owner, dfList in zip(listOwner, listDataframe):
for df in dfList:
if df.empty:
appendOwner(df, owner, list_test_2)
print(listDataframe)
You can use the appendOwner function to append the rows from listOwner to an empty dataframe.