Create unique ID from the existing two columns, python - python

My question is: how to efficiently sign data unique id numbers from existing id columns? For example: I have two columns [household_id], and [person_no]. I try to make a new column, the query would be: household_id + '_' + person_no.
here is a sample:
hh_id pno
682138 1
365348 1
365348 2
try to get:
unique_id
682138_1
365348_1
365348_2
and add this unique_id as a new column.
I am applying Python. My data is very large. Any efficient way to do it would be great. Thanks!

You can use pandas.
Assuming your data is in a csv file, read in the data:
import pandas as pd
df = pd.read_csv('data.csv', delim_whitespace=True)
Create the new id column:
df['unique_id'] = df.hh_id.astype(str) + '_' + df.pno.astype(str)
Now df looks like this:
hh_id pno unique_id
0 682138 1 682138_1
1 365348 1 365348_1
2 365348 2 365348_2
Write back to a csv file:
df.to_csv('out.csv', index=False)
The file content looks like this:
hh_id,pno,unique_id
682138,1,682138_1
365348,1,365348_1
365348,2,365348_2

Related

Pandas create two new columns based on 2 existing columns

I have a dataframe like the below:
dummy_dict_existing = {'Email':['joblogs#gmail.com', 'joblogs#gmail.com'],
'Ticket_Category': ['Tier1', 'Tier2'],
'Quantity_Purchased': [5,2],
'Total_Price_Paid':[1345.45, 10295.88]}
Email Ticket_Category Quantity_Purchased Total_Price_Paid
0 joblogs#gmail.com Tier1 5 1345.45
1 joblogs#gmail.com Tier2 2 10295.88
What I'm trying to do is to create 2 new columns "Tier1_Quantity_Purchased" and "Tier2_Quantity_Purchased" based on the existing dataframe, and sum the total of "Total_Price_Paid" as below:
dummy_dict_desired = {'Email':['joblogs#gmail.com'],
'Tier1_Quantity_Purchased': [5],
'Tier2_Quantity_Purchased':[2],
'Total_Price_Paid':[11641.33]}
Email Tier1_Quantity_Purchased Tier2_Quantity_Purchased Total_Price_Paid
0 joblogs#gmail.com 5 2 11641.33
Any help would be greatly appreciated. I know there is an easy way to do this, just can't figure out how without writing some silly for loop!
What you want to do is to pivot your table, and then add a column with aggregated data from the original table.
df = pd.DataFrame(dummy_dict_existing)
pivot_df = df.pivot(index='Email', columns='Ticket_Category', values='Quantity_Purchased')
pivot_df['total'] = df.groupby('Email')['Total_Price_Paid'].sum()
Email
Tier1
Tier2
total
joblogs#gmail.com
5
2
11641.33
For more details on pivoting, take a look at How can I pivot a dataframe?
import pandas as pd
dummy_dict_existing = {'Email':['joblogs#gmail.com', 'joblogs#gmail.com'],
'Ticket_Category': ['Tier1', 'Tier2'],
'Quantity_Purchased': [5,2],
'Total_Price_Paid':[1345.45, 10295.88]}
df = pd.DataFrame(dummy_dict_existing)
df2 = df[['Ticket_Category', 'Quantity_Purchased']]
df_transposed = df2.T
df_transposed.columns = ['Tier1_purchased', 'Tier2_purchased']
df_transposed = df_transposed.iloc[1:]
df_transposed = df_transposed.reset_index()
df_transposed = df_transposed[['Tier1_purchased', 'Tier2_purchased']]
df = df.groupby('Email')[['Total_Price_Paid']].sum()
df = df.reset_index()
df.join(df_transposed)
output

Python - Convert columns with specific base_name into rows

I have the following format of a csv file:
id a_mean_val_1 a_mean_val_2 a_var_val_1 a_var_val_2 b_mean_val_1 b_mean_val_2 b_var_val_1 b_var_val_2
I would like to melt the columns 1 and 2 for all a and b features into rows as follows:
id a_mean a_var b_mean b_var
1 val1 val1 val1 val1
1 val2 val2 val2 val2
I am unsure how to achieve the melt function in pandas, where I could basically have an expression that matches keeps the base name: a_mean as root column and everything that has a suffix for that variable to melt them into rows.
Is there another method I could use to specify these rules?
Thank you
Like this:
rows = []
for line in open('mycsv.csv'):
fields = line.split(',')
rows.append( fields[0::2] )
rows.append( fields[1::2] )
df = pandas.DataFrame(rows, fields=['a_mean','a_var','b_mean','b_var'])
That doesn't provide an ID number. Is the ID part of the CSV file?
I went through the columns and if they were a part of a base column, then appended to a list. Finally, converted those to a dataframe.
So this code would work regardless of the order of the columns
[UPDATED WITH ID]
Since we're adding the entire columns one after the other, the ids will always start from the top, go to the end, and then repeat. So we can take "id" of the original df and multiply that by the number of rows to get the "id" for the new df.
Here's the CSV I used:
id,a_mean_val_1,a_mean_val_2,a_var_val_1,a_var_val_2,b_mean_val_1,b_mean_val_2,b_var_val_1,b_var_val_2
1,a_mean_val_1, a_mean_val_2, a_var_val_1, a_var_val_2, b_mean_val_1 ,b_mean_val_2, b_var_val_1, b_var_val_2
2,a_mean_val_5, a_mean_val_6, a_var_val_5, a_var_val_6, b_mean_val_5 ,b_mean_val_6, b_var_val_5, b_var_val_6
df = pd.read_csv('data_csv.csv')
# Ignore ID
columns = df.columns.tolist()[1:]
df_dict = {}
base = ['a_mean', 'a_var', 'b_mean', 'b_var']
for bas in base:
df_dict[bas] = []
for col in columns:
# for example, "a_mean" in "a_mean_val_1" then append
if(bas in col):
df_dict[bas] = df_dict[bas] + df[col].tolist()
ids = df['id'].tolist()
df_new = pd.DataFrame(df_dict)
df_new['id'] = ids*df.shape[0]
a_mean a_var b_mean b_var id
a_mean_val_1 a_var_val_1 b_mean_val_1 b_var_val_1 1
a_mean_val_5 a_var_val_5 b_mean_val_5 b_var_val_5 2
a_mean_val_2 a_var_val_2 b_mean_val_2 b_var_val_2 1
a_mean_val_6 a_var_val_6 b_mean_val_6 b_var_val_6 2

Python: store a value in a variable so that you can recognize each reoccurence

If this question is unclear, I am very open to constructive criticism.
I have an excel table with about 50 rows of data, with the first column in each row being a date. I need to access all the data for only one date, and that date appears only about 1-5 times. It is the most recent date so I've already organized the table by date with the most recent being at the top.
So my goal is to store that date in a variable and then have Python look only for that variable (that date) and take only the columns corresponding to that variable. I need to use this code on 100's of other excel files as well, so it would need to arbitrarily take the most recent date (always at the top though).
My current code below simply takes the first 5 rows because I know that's how many times this date occurs.
import os
from numpy import genfromtxt
import pandas as pd
path = 'Z:\\folderwithcsvfile'
for filename in os.listdir(path):
file_path = os.path.join(path, filename)
if os.path.isfile(file_path):
broken_df = pd.read_csv(file_path)
df3 = broken_df['DATE']
df4 = broken_df['TRADE ID']
df5 = broken_df['AVAILABLE STOCK']
df6 = broken_df['AMOUNT']
df7 = broken_df['SALE PRICE']
print (df3)
#print (df3.head(6))
print (df4.head(6))
print (df5.head(6))
print (df6.head(6))
print (df7.head(6))
This is a relatively simple filtering operation. You state that you want to "take only the columns" that are the latest date, so I assume that an acceptable result will be a filter DataFrame with just the correct columns.
Here's a simple CSV that is similar to your structure:
DATE,TRADE ID,AVAILABLE STOCK
10/11/2016,123,123
10/11/2016,123,123
10/10/2016,123,123
10/9/2016,123,123
10/11/2016,123,123
Note that I mixed up the dates a little bit, because it's hacky and error-prone to just assume that the latest dates will be on the top. The following script will filter it appropriately:
import pandas as pd
import numpy as np
df = pd.read_csv('data.csv')
# convert the DATE column to datetimes
df['DATE'] = pd.to_datetime(df['DATE'])
# find the latest datetime
latest_date = df['DATE'].max()
# use index filtering to only choose the columns that equal the latest date
latest_rows = df[df['DATE'] == latest_date]
print (latest_rows)
# now you can perform your operations on latest_rows
In my example, this will print:
DATE TRADE ID AVAILABLE STOCK
0 2016-10-11 123 123
1 2016-10-11 123 123
4 2016-10-11 123 123

JSON File getting output as a dictionary for every row and need to create a DataFrame from it

I have a .json file and when I convert it into a Data frame by -
df = pd.read_json('tummy.json')
The output looks like -
results
0 {u'objectId': u'06Dig7sXhU', u'SpecialProperti...'
1 {u'objectId': u'07VO1j4gVC', u'SpecialProperti...'
Every row seems to be a dictionary itself. I want to extract every row and create a Data Frame out of it. I would really appreciate some help on how to proceed.
IIUC you can use:
import pandas as pd
s = pd.Series(( {u'objectId': u'06Dig7sXhU', u'SpecialProperties': u'456456'},
{u'objectId': u'07VO1j4gVC', u'SpecialProperties': u'878421'}))
df = pd.DataFrame({'results':s})
print df
results
0 {u'objectId': u'06Dig7sXhU', u'SpecialProperti...
1 {u'objectId': u'07VO1j4gVC', u'SpecialProperti...
print pd.DataFrame([x for x in df['results']], index=df.index)
SpecialProperties objectId
0 456456 06Dig7sXhU
1 878421 07VO1j4gVC

Slicing Pandas DataFrame based on csv

Let's say I have a Pandas DataFrame like following.
df = pd.DataFrame({'Name' : ['A','B','C'],
'Country' : ['US','UK','SL']})
Country Name
0 US A
1 UK B
2 SL C
And I'm having a csv like following.
Name,Extended
A,Jorge
B,Alex
E,Mark
F,Bindu
I need to check whether df['Name'] is in csv and if so get the "Extended". If not I need to just get the "Name". So my Expected output is like following.
Country Name Extended
0 US A Jorge
1 UK B Alex
2 SL C C
Following shows what I tried so far.
f = open('mycsv.csv','r')
lines = f.readlines()
def parse(x):
for line in lines:
if x in line.split(',')[0]:
return line.strip().split(',')[1]
df['Extended'] = df['Name'].apply(parse)
Name Country Extended
0 A US Jorge
1 B UK Alex
2 C SL None
I can not figure out how to get the "Name" for C at "Extended"(else part in the code)? Any help.
You can use the "fillna" function from pandas like this:
import pandas as pd
df1 = pd.DataFrame({'Name' : ['A','B','C'],
'Country' : ['US','UK','SL']})
df2 = pd.DataFrame.from_csv('mycsv.csv', index_col=None)
df_merge = pd.merge(df, f, how="left", on="Name")
df_merge["Extended"].fillna('Name', inplace=True)
You could just load the csv as a df and then assign using where:
df['Name'] = df2['Extended'].where(df2['Name'] != df2['Extended'], df2['Name'])
So here we use the boolean condition to test if 'Name' is not equal to 'Extended' and use that value, otherwise just use 'Name'.
Also is 'Extended' always either different or same as 'Name'? If so why not just assign the value of extended to the dataframe:
df['Name'] = df2['Extended']
This would be a lot simpler.

Categories

Resources