Python Multiple key in a text file - python

My Data is as below:
Name: Joe
Age: 26
Property: 1 of 3
Item : Car
Make: Toyota
Model: Corolla
Year:2006
Property: 2 of 3
Item : House
Address : new Street
Cost : 20000
Property: 3 of 3
Item: Stocks
Investment: 1000
Name: Blogg
Age: 28
Property: 1 of 2
Item : Bike
BikeMake: Harley
BikeModel: IronRod
BikeYear:2018
Property: 2 of 2
Item: Stocks
Investment: 2000
I need the result to look like below:
Name
Age
Property
Item
Make
Model
Year
Address
Cost
Investment
BikeMake
BikeMode
BikeYear
Joe
26
1 of 3
Car
Toyota
Corolla
2006
Joe
26
2 of 3
House
new Street
20000
Joe
26
3 of 3
Stocks
1000
Blogg
28
1 of 2
Bike
Harley
Ironrod
2018
Blogg
26
3 of 3
Stocks
2000
My code is currently
for line in t:
print(line)
key, _, value = line.partition(": ")
if not value: # separator was not found
value = "NA"
if "Name" in key:
stuff[index] = {"Reference": [value]} # Always use a list as vale
current_key = index
index += 1
elif key not in stuff[current_key]: # If key does not exist
stuff[current_key][key] = [value] # Create key with value in a list.
else:
stuff[current_key][key].append(value)
My current results are being pivoted by the Name key (e.g)
Name
Age
Property
Item
Make
Model
Year
Address
Cost
Investment
BikeMake
BikeMode
BikeYear
Joe
26
1 of 3,2 of 3,3 of 3
Car,House,Stocks
Toyota
Corolla
2006
New Street
20000
1000
Blogg
28
1 of 2,2 of 2
Bike,Stocks
2000
Harley
IronRod
2018

Related

reshape a dataframe with internal headers as new columns

I have a data frame as below :
col value
0 companyId 123456
1 company_name small company
2 department IT
3 employee_name Jack
4 rank Grade 8
5 department finance
6 employee_name Tim
7 rank Grade 6
and i would like the data frame to be reshaped to tabular format, ideally like this:
companyId company_name department employee_name rank
0 123456 small company IT Jack Grade 8
1 123456 small company finance Tim Grade 6
can any one help me please? thanks.
Making two assumptions you could reshape your data.
1- the companies are determined using headers and all subsequent rows are data from employees of the company
2- there is a given starting item to define employees records (here department)
headers = ['companyId', 'company_name']
first_item = 'department'
masks = {h: df['col'].eq(h) for h in headers}
df2 = (df
# move headers as new columns
.assign(**{h: df['value'].where(m).ffill().bfill() for h,m in masks.items()})
# and drop their rows
.loc[~pd.concat(masks, axis=1).any(1)]
# compute a unique identifier per employee
.assign(idx=lambda d: d['col'].eq(first_item).cumsum())
# pivot the data
.pivot(index=['idx']+headers, columns='col', values='value')
.reset_index(headers)
)
output:
companyId company_name department employee_name rank
1 123456 small company IT Jack Grade 8
2 123456 small company finance Tim Grade 6
Example on a more complex input:
col value
0 companyId 123456
1 company_name small company
2 department IT
3 employee_name Jack
4 rank Grade 8
5 department finance
6 employee_name Tim
7 rank Grade 6
8 companyId 67890
9 company_name other company
10 department IT
11 employee_name Jane
12 rank Grade 9
13 department management
14 employee_name Tina
15 rank Grade 12
output:
companyId company_name department employee_name rank
1 123456 small company IT Jack Grade 8
2 123456 small company finance Tim Grade 6
3 67890 other company IT Jane Grade 9
4 67890 other company management Tina Grade 12

Python column hierarchy creation for groupby

I have a question I can't get my head around, although I'm sure it's dead simple.
I import an excel file, with sales figures of cars.
I need to be able to report on it by country.
The country is not part of the file, but I have the info of all the cars of each country. (I can create another DataFrame from it, or a list, or dict...)
My idea was to create a hierarchie in the columns. I just can't figure out how.
import pandas as pd
german=['BMW','Audi','Mercedes','Volkswagen']
italian=['Fiat','Ferrari']
toclean=pd.DataFrame([['car','4','5',10,20,15,50,20,13,24]],
columns=['type','wheels','seats','BMW','Audi','Mercedes','Volkswagen','Fiat','Ferrari','SEAT'])
type
wheels
seats
BMW
Audi
Mercedes
Volkswagen
Fiat
Ferrari
SEAT
car.
4
5
10
20
15
50
20
13
24
Something like this?
def country(brand):
if brand in german:
return 'germany'
elif brand in italian:
return 'italy'
else:
return None
long_df = toclean.melt()
long_df['country'] = long_df['variable'].map(country)
long_df
variable value country
0 type car None
1 wheels 4 None
2 seats 5 None
3 BMW 10 germany
4 Audi 20 germany
5 Mercedes 15 germany
6 Volkswagen 50 germany
7 Fiat 20 italy
8 Ferrari 13 italy
9 SEAT 24 None
long_df.groupby('country')['value'].sum()
country
germany 95
italy 33
Name: value, dtype: int64

Get latest value looked up from other dataframe

My first data frame
product=pd.DataFrame({
'Product_ID':[101,102,103,104,105,106,107,101],
'Product_name':['Watch','Bag','Shoes','Smartphone','Books','Oil','Laptop','New Watch'],
'Category':['Fashion','Fashion','Fashion','Electronics','Study','Grocery','Electronics','Electronics'],
'Price':[299.0,1350.50,2999.0,14999.0,145.0,110.0,79999.0,9898.0],
'Seller_City':['Delhi','Mumbai','Chennai','Kolkata','Delhi','Chennai','Bengalore','New York']
})
My 2nd data frame has transactions
customer=pd.DataFrame({
'id':[1,2,3,4,5,6,7,8,9],
'name':['Olivia','Aditya','Cory','Isabell','Dominic','Tyler','Samuel','Daniel','Jeremy'],
'age':[20,25,15,10,30,65,35,18,23],
'Product_ID':[101,0,106,0,103,104,0,0,107],
'Purchased_Product':['Watch','NA','Oil','NA','Shoes','Smartphone','NA','NA','Laptop'],
'City':['Mumbai','Delhi','Bangalore','Chennai','Chennai','Delhi','Kolkata','Delhi','Mumbai']
})
I want Price from 1st data frame to come in the merged dataframe. Common element being 'Product_ID'. Note that against product_ID 101, there are 2 prices - 299.00 and 9898.00. I want the later one to come in the merged data set i.e. 9898.0 (Since this is latest price)
Currently my code is not giving the right answer. It is giving both
customerpur = pd.merge(customer,product[['Price','Product_ID']], on="Product_ID", how = "left")
customerpur
id name age Product_ID Purchased_Product City Price
0 1 Olivia 20 101 Watch Mumbai 299.0
1 1 Olivia 20 101 Watch Mumbai 9898.0
There is no explicit timestamp so I assume the index is the order of the dataframe. You can drop duplicates at the end:
customerpur.drop_duplicates(subset = ['id'], keep = 'last')
result:
id name age Product_ID Purchased_Product City Price
1 1 Olivia 20 101 Watch Mumbai 9898.0
2 2 Aditya 25 0 NA Delhi NaN
3 3 Cory 15 106 Oil Bangalore 110.0
4 4 Isabell 10 0 NA Chennai NaN
5 5 Dominic 30 103 Shoes Chennai 2999.0
6 6 Tyler 65 104 Smartphone Delhi 14999.0
7 7 Samuel 35 0 NA Kolkata NaN
8 8 Daniel 18 0 NA Delhi NaN
9 9 Jeremy 23 107 Laptop Mumbai 79999.0
Please note keep = 'last' argument since we are keeping only last price registered.
Deduplication should be done before merging if Yuo care about performace or dataset is huge:
product = product.drop_duplicates(subset = ['Product_ID'], keep = 'last')
In your data frame there is no indicator of latest entry, so you might need to first remove the the first entry for id 101 from product dataframe as follows:
result_product = product.drop_duplicates(subset=['Product_ID'], keep='last')
It will keep the last entry based on Product_ID and you can do the merge as:
pd.merge(result_product, customer, on='Product_ID')

select rows with except sqllite3

I have a database with a dataframe that contains the columns: Name, Award, Winner(1 means won and 0 means did not win) and some other things that are irrelevant for this question.
I want to make a dataframe with the names of people that were selected for the award actress(al awards with the name actress in them count), but never won, using sqlite 3 in python.
These are the first five rows of the dataframe:
Unnamed: 0 CeremonyNumber CeremonyYear CeremonyMonth CeremonyDay FilmYear Award Winner Name FilmDetails
0 0 1 1929 5 16 1927 Actor 1 Emil Jannings The Last Command
1 1 1 1929 5 16 1927 Actor 0 Richard Barthelmess The Noose
2 2 1 1929 5 16 1927 Actress 1 Janet Gaynor 7th Heaven
3 3 1 1929 5 16 1927 Actress 0 Louise Dresser A Ship Comes In
4 4 1 1929 5 16 1927 Actress 0 Gloria Swanson Sadie Thompson
I tried it with this query, but this resulted not in the correct result.
query = '''
select Name
from oscars
where Award like "Actress%"
except select Name
from oscars
where Award like "Actress%" and Winner == 1
'''
The outcome of this query should be a dataframe like this:
Name
0 Abigail Breslin
1 Adriana Barraza
2 Agnes Moorehead
3 Alfre Woodard
4 Ali MacGraw
In order to select all the actresses who were selected for the award and never won, you should use AND rather than EXCEPT. Something like this should work:
SELECT Name from Oscars WHERE Award LIKE "Actress%" AND Winner = 0
Refer to the sqlite docs at https://www.sqlite.org/index.html for more information.

Group by and Count distinct words in Pandas DataFrame

By year and name, I am hoping to count the occurrence of words in a dataframe from imported from Excel which results will also be exported to Excel.
This is the sample code:
source = pd.DataFrame({'Name' : ['John', 'Mike', 'John','John'],
'Year' : ['1999', '2000', '2000','2000'],
'Message' : [
'I Love You','Will Remember You','Love','I Love You]})
Excepted results are the following in a dataframe. Any ideas?
Year Name Message Count
1999 John I 1
1999 John love 1
1999 John you 1
2000 Mike Will 1
2000 Mike Remember 1
2000 Mike You 1
2000 John Love 2
2000 John I 1
2000 John You 1
I think you can first split column Message, create Serie and add it to original source. Last groupby with size:
#split column Message to new df, create Serie by stack
s = (source.Message.str.split(expand=True).stack())
#remove multiindex
s.index = s.index.droplevel(-1)
s.name= 'Message'
print(s)
0 I
0 Love
0 You
1 Will
1 Remember
1 You
2 Love
3 I
3 Love
3 You
Name: Message, dtype: object
#remove old column Message
source = source.drop(['Message'], axis=1)
#join Serie s to df source
df = (source.join(s))
#aggregate size
print (df.groupby(['Year', 'Name', 'Message']).size().reset_index(name='count'))
Year Name Message count
0 1999 John I 1
1 1999 John Love 1
2 1999 John You 1
3 2000 John I 1
4 2000 John Love 2
5 2000 John You 1
6 2000 Mike Remember 1
7 2000 Mike Will 1
8 2000 Mike You 1

Categories

Resources