Python : Create a dataframe from existing pandas dataframe - python

Now, my dataset looks like this:
tconst Actor1 Actor2 Actor3 Actor4 Actor5 Actor6 Actor7 Actor8 Actor9 Actor10
0 tt0000001 NaN GreaterEuropean, WestEuropean, French GreaterEuropean, British NaN NaN NaN NaN NaN NaN NaN
1 tt0000002 NaN GreaterEuropean, WestEuropean, French NaN NaN NaN NaN NaN NaN NaN NaN
2 tt0000003 NaN GreaterEuropean, WestEuropean, French GreaterEuropean, WestEuropean, French GreaterEuropean, WestEuropean, French NaN NaN NaN NaN NaN NaN
3 tt0000004 NaN GreaterEuropean, WestEuropean, French NaN NaN NaN NaN NaN NaN NaN NaN
4 tt0000005 NaN GreaterEuropean, British GreaterEuropean, British NaN NaN NaN NaN NaN NaN NaN
I used replace and map function to get here.
I want to create a dataframe from the above data frames such as I can get resulting dataframe as below.
tconst GreaterEuropean WestEuropean French GreaterEuropean British Arab British ............
tt0000001 2 1 0 4 1 0 2 .....
tt0000002 0 2 4 0 1 3 0 .....
GreaterEuropean British WestEuropean Italian French ... represents number of ehnicities of different actors in a particlular movie specified by tconst.
That would be like a count matrix, such as for a movie tt00001 there are 5 Arabs, 2 British, 1 WestEuropean and so on such that in a movie, how many actors are there who belong to these ethnicities.
Link to data - https://drive.google.com/open?id=1oNfbTpmLA0imPieRxGfU_cBYVfWN3tZq

import numpy as np
import pandas as pd
df_melted = pd.melt(df, id_vars = 'tconst',
value_vars = df.columns[2:].tolist(),
var_name = 'actor',
value_name = 'ethnicities').dropna()
print(df_melted.ethnicities.str.get_dummies(sep = ',').sum())
Output:
British 169
EastAsian 9
EastEuropean 17
French 73
Germanic 9
GreaterEastAsian 13
Hispanic 9
IndianSubContinent 2
Italian 7
Japanese 4
Jewish 25
Nordic 7
WestEuropean 105
Asian 15
GreaterEuropean 316
dtype: int64
This is close to what you wanted, but not exact. To get what you wanted, without typing out the lists of columns or values, is more complicated.
From: https://stackoverflow.com/a/48120674/6672746
def change_column_order(df, col_name, index):
cols = df.columns.tolist()
cols.remove(col_name)
cols.insert(index, col_name)
return df[cols]
def split_df(dataframe, col_name, sep):
orig_col_index = dataframe.columns.tolist().index(col_name)
orig_index_name = dataframe.index.name
orig_columns = dataframe.columns
dataframe = dataframe.reset_index() # we need a natural 0-based index for proper merge
index_col_name = (set(dataframe.columns) - set(orig_columns)).pop()
df_split = pd.DataFrame(
pd.DataFrame(dataframe[col_name].str.split(sep).tolist())
.stack().reset_index(level=1, drop=1), columns=[col_name])
df = dataframe.drop(col_name, axis=1)
df = pd.merge(df, df_split, left_index=True, right_index=True, how='inner')
df = df.set_index(index_col_name)
df.index.name = orig_index_name
# merge adds the column to the last place, so we need to move it back
return change_column_order(df, col_name, orig_col_index)
Using those excellent functions:
df_final = split_df(df_melted, 'ethnicities', ',')
df_final.set_index(['tconst', 'actor'], inplace = True)
df_final.pivot_table(index = ['tconst'],
columns = 'ethnicities',
aggfunc = pd.Series.count).fillna(0).astype('int')
Output:
ethnicities British EastAsian EastEuropean French Germanic GreaterEastAsian Hispanic IndianSubContinent Italian Japanese Jewish Nordic WestEuropean Asian GreaterEuropean
tconst
tt0000001 1 0 0 1 0 0 0 0 0 0 0 0 1 0 2
tt0000002 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1
tt0000003 0 0 0 3 0 0 0 0 0 0 0 0 3 0 3
tt0000004 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1
tt0000005 2 0 0 0 0 0 0 0 0 0 0 0 0 0 2

Pandas have it all.
title_principals["all"] = title_principals["Actor1"].astype(str)+','+title_principals["Actor2"].astype(str)+','+title_principals["Actor3"].astype(str)+','+title_principals["Actor4"].astype(str)+','+title_principals["Actor5"].astype(str)+','+title_principals["Actor6"].astype(str)+','+title_principals["Actor7"].astype(str)+','+title_principals["Actor8"].astype(str)+','+title_principals["Actor9"].astype(str)+','+title_principals["Actor10"].astype(str)
and then, from the string, make the count and drop the other variables.
title_principals["GreaterEuropean"] = title_principals["all"].str.contains(r'GreaterEuropean').sum()

Related

Convert a single row into a different dataframe in pandas python

I am working on a dataframe of shape 146 rows x 48 columns. The columns are
['Region','Rank 2015','Score 2015','Economy 2015','Family 2015','Health 2015','Freedom 2015','Generosity 2015','Trust 2015','Rank 2016','Score 2016','Economy 2016','Family 2016','Health 2016','Freedom 2016','Generosity 2016','Trust 2016','Rank 2017','Score 2017','Economy 2017','Family 2017','Health 2017','Freedom 2017','Generosity 2017','Trust 2017','Rank 2018','Score 2018','Economy 2018','Family 2018','Health 2018','Freedom 2018','Generosity 2018','Trust 2018','Rank 2019','Score 2019','Economy 2019','Family 2019','Health 2019','Freedom 2019','Generosity 2019','Trust 2019','Score Mean','Economy Mean','Family Mean','Health Mean','Freedom Mean','Generosity Mean','Trust Mean']
I want to access a particular row and want to convert it to to the following dataframe
Year Rank Score Family Health Freedom Generosity Trust
0 2015 NaN NaN NaN NaN NaN NaN NaN
1 2016 NaN NaN NaN NaN NaN NaN NaN
2 2017 NaN NaN NaN NaN NaN NaN NaN
3 2018 NaN NaN NaN NaN NaN NaN NaN
4 2019 NaN NaN NaN NaN NaN NaN NaN
Any help is welcomed & Thank you in advance.
An alternate way:
cols=['Region','Rank 2015','Score 2015','Economy 2015','Family 2015','Health 2015','Freedom 2015','Generosity 2015', 'Trust 2015','Rank 2016','Score 2016','Economy 2016','Family 2016','Health 2016','Freedom 2016','Generosity 2016','Trust 2016', 'Rank 2017','Score 2017','Economy 2017','Family 2017','Health 2017','Freedom 2017','Generosity 2017','Trust 2017','Rank 2018','Score 2018','Economy 2018','Family 2018','Health 2018','Freedom 2018','Generosity 2018','Trust 2018','Rank 2019','Score 2019','Economy 2019','Family 2019','Health 2019','Freedom 2019','Generosity 2019','Trust 2019','Score Mean','Economy Mean','Family Mean','Health Mean','Freedom Mean','Generosity Mean','Trust Mean']
# source dataframe
df1 = pd.DataFrame(columns=cols)
df1.loc[0] = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
#target dataframe
df2 = pd.DataFrame(columns=['Year','Rank','Score','Family','Health','Freedom','Generosity','Trust','Economy'])
df2['Year']=['2015','2016','2017','2018','2019','Mean']
df2.set_index('Year', inplace=True)
idx = 0 # source row to copy
for col in df1.columns[1:]:
c,r = col.split(" ")
df2.at[r,c] = df1.at[idx, col]
print (df2)
Rank Score Family Health Freedom Generosity Trust Economy
Year
2015 1 1 1 1 1 1 1 1
2016 1 1 1 1 1 1 1 1
2017 1 1 1 1 1 1 1 1
2018 1 1 1 1 1 1 1 1
2019 1 1 1 1 1 1 1 1
Mean NaN 1 1 1 1 1 1 1
Here's a solution utilizing list comprehension:
The input:
cols = ['Region','Rank 2015','Score 2015','Economy 2015','Family 2015','Health 2015','Freedom 2015','Generosity 2015','Trust 2015','Rank 2016','Score 2016','Economy 2016','Family 2016','Health 2016','Freedom 2016','Generosity 2016','Trust 2016','Rank 2017','Score 2017','Economy 2017','Family 2017','Health 2017','Freedom 2017','Generosity 2017','Trust 2017','Rank 2018','Score 2018','Economy 2018','Family 2018','Health 2018','Freedom 2018','Generosity 2018','Trust 2018','Rank 2019','Score 2019','Economy 2019','Family 2019','Health 2019','Freedom 2019','Generosity 2019','Trust 2019','Score Mean','Economy Mean','Family Mean','Health Mean','Freedom Mean','Generosity Mean','Trust Mean']
df = pd.DataFrame(np.random.randint(1,10,(3,48)))
df.columns = cols
print(df.iloc[:, :4])
Region Rank 2015 Score 2015 Economy 2015
0 7 9 9 9
1 8 7 2 3
2 3 3 4 5
And the new dataframe would be:
target_cols = ['Rank', 'Score', 'Family', 'Health', 'Freedom', 'Generosity', 'Trust']
years = ['2015', '2016', '2017', '2018', '2019']
newdf = pd.DataFrame([df.loc[1, [x + ' ' + year for x in target_cols]].values for year in years])
newdf.columns = target_cols
newdf['year'] = years
print(newdf)
Rank Score Family Health Freedom Generosity Trust year
0 7 2 6 9 3 4 9 2015
1 2 8 1 1 7 6 1 2016
2 7 4 2 5 1 7 4 2017
3 9 7 1 4 7 5 2 2018
4 5 4 4 9 1 6 2 2019
Assuming that you have only the target years are those spanning between 2015 and 2019; and that the target columns are known.
I would procede as follows:
(1) define the target columns and years
target_columns = ['Rank', 'Score', 'Family', 'Health', 'Freedom', 'Generosity', 'Trust'] target_years = ['2015', '2016', '2017', '2018', '2019']
(2) retrieve the particular row, I assume your starting dataframe to be initial_dataframe
particular_row = initial_dataframe.iloc[0]
(3) retrieve and reshape the information from the particular_row
reshaped_row = { 'Year': target_years }
reshaped_row.update({ column_name: [ particular_row[column_name + ' ' + year_name] for year_name in target_years ] for column_name in target_columns })
(4) assign the reshaped row to the output_dataframe
output_dataframe = pd.Dataframe(reshaped_row)
Have you tried using a 2D array? I would find that to be the easiest. Otherwise, you could also use a dictionary. https://www.w3schools.com/python/python_dictionaries.asp
I didn't get your question properly but I can give you hint how to translate the data.
df = pd.DataFrame(li)
df = df[0].str.split("(\d{4})", expand=True)
df = df[df[2]==""]
col_name = df[0].unique()
df_new = df.pivot(index=1, columns=0, values=2)
df_new.drop(df_new.index[0], inplace=True)
df_new:
Economy Family Freedom Generosity Health Rank Score Trust
1
2016
2017
2018
2019
You can write your own logic.
It needs a lot of manipulation, a simple idea is to modify to required dict and then make df
In [61]: dicts = {}
In [62]: for t in text[1:]:
...: n,y = t.split(" ")
...: if n not in dicts:
...: dicts[n]=[]
...: if y !="Mean":
...: if n == 'Rank':
...: dicts[n].append(y)
...: else:
...: dicts[n].append(pd.np.NaN)
...:
In [63]: df = pd.DataFrame(dicts)
In [64]: df['Year'] = df['Rank']
In [65]: df['Rank'] = df['Family']
In [66]: df
Out[66]:
Rank Score Economy Family Health Freedom Generosity Trust Year
0 NaN NaN NaN NaN NaN NaN NaN NaN 2015
1 NaN NaN NaN NaN NaN NaN NaN NaN 2016
2 NaN NaN NaN NaN NaN NaN NaN NaN 2017
3 NaN NaN NaN NaN NaN NaN NaN NaN 2018
4 NaN NaN NaN NaN NaN NaN NaN NaN 2019

striding through pandas dataframe

I have a Dataframe of the form
date_time uids
2018-10-16 23:00:00 1000,1321,7654,1321
2018-10-16 23:10:00 7654
2018-10-16 23:20:00 NaN
2018-10-16 23:30:00 7654,1000,7654,1321,1000
2018-10-16 23:40:00 691,3974,3974,323
2018-10-16 23:50:00 NaN
2018-10-17 00:00:00 NaN
2018-10-17 00:10:00 NaN
2018-10-17 00:20:00 27,33,3974,3974,7665,27
This is a very big data frame containing the 5 mins time interval and the number of appearances of ids during those time intervals.
I want to iterate over these DataFrame 6 rows at a time (corresponding to 1 hour) and create DataFrame containing the ID and the number of times each id appear during this time.
Expected output is one dataframe per hour information. For example, in the above case dataframe for the hour 23 - 00 will have this form
uid 1 2 3 4 5 6
1000 1 0 0 2 0 0
1321 2 0 0 1 0 0
and so on
How can I do this efficiently?
I don't have an exact solution but you could create a pivot table: ids on the index and datetimes on the columns. Then you just have to select the columns you want.
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"date_time": [
"2018-10-16 23:00:00",
"2018-10-16 23:10:00",
"2018-10-16 23:20:00",
"2018-10-16 23:30:00",
"2018-10-16 23:40:00",
"2018-10-16 23:50:00",
"2018-10-17 00:00:00",
"2018-10-17 00:10:00",
"2018-10-17 00:20:00",
],
"uids": [
"1000,1321,7654,1321",
"7654",
np.nan,
"7654,1000,7654,1321,1000",
"691,3974,3974,323",
np.nan,
np.nan,
np.nan,
"27,33,3974,3974,7665,27",
],
}
)
df["date_time"] = pd.to_datetime(df["date_time"])
df = (
df.set_index("date_time") #do not use set_index if date_time is current index
.loc[:, "uids"]
.str.extractall(r"(?P<uids>\d+)")
.droplevel(level=1)
) # separate all the ids
df["number"] = df.index.minute.astype(float) / 10 + 1 # get the number 1 to 6 depending on the minutes
df_pivot = df.pivot_table(
values="number",
index="uids",
columns=["date_time"],
) #dataframe with all the uids on the index and all the datetimes in columns.
You can apply this to the whole dataframe or just a subset containing 6 rows. Then you rename your columns.
You can use the function crosstab:
df['uids'] = df['uids'].str.split(',')
df = df.explode('uids')
df['date_time'] = df['date_time'].dt.minute.floordiv(10).add(1)
pd.crosstab(df['uids'], df['date_time'], dropna=False)
Output:
date_time 1 2 3 4 5 6
uids
1000 1 0 0 2 0 0
1321 2 0 0 1 0 0
27 0 0 2 0 0 0
323 0 0 0 0 1 0
33 0 0 1 0 0 0
3974 0 0 2 0 2 0
691 0 0 0 0 1 0
7654 1 1 0 2 0 0
7665 0 0 1 0 0 0
We can achieve this with extracting the minutes from your datetime column. Then using pivot_table to get your wide format:
df['date_time'] = pd.to_datetime(df['date_time'])
df['minute'] = df['date_time'].dt.minute // 10
piv = (df.assign(uids=df['uids'].str.split(','))
.explode('uids')
.pivot_table(index='uids', columns='minute', values='minute', aggfunc='size')
)
minute 0 1 2 3 4
uids
1000 1.0 NaN NaN 2.0 NaN
1321 2.0 NaN NaN 1.0 NaN
27 NaN NaN 2.0 NaN NaN
323 NaN NaN NaN NaN 1.0
33 NaN NaN 1.0 NaN NaN
3974 NaN NaN 2.0 NaN 2.0
691 NaN NaN NaN NaN 1.0
7654 1.0 1.0 NaN 2.0 NaN
7665 NaN NaN 1.0 NaN NaN

pd.wide_to_long() lost data

I'm very new to Python. I've tried to reshape a data set using pd.wide_to_long. The original dataframe looks like this:
chk1 chk2 chk3 ... chf1 chf2 chf3 id var1 var2
0 3 4 2 ... nan nan nan 1 1 0
1 4 4 4 ... nan nan nan 2 1 0
2 2 nan nan ... 3 4 3 3 0 1
3 3 3 3 ... 3 2 2 4 1 0
I used the following code:
df2 = pd.wide_to_long(df,
stubnames=['chk', 'chf'],
i=['id', 'var1', 'var2'],
j='type')
When checking the data after these codes, it looks like this
chk chf
id var1 var2 egenskap
1 1 0 1 3 nan
2 4 nan
3 2 nan
4 nan nan
5 4 nan
6 nan nan
7 4 nan
8 4 nan
2 1 0 1 4 nan
2 4 nan
3 4 nan
4 5 nan
But when I check the columns in the new data set, it seems that all columns except 'chk' and 'chf' are gone!
df2.columns
Out[47]: Index(['chk', 'chf'], dtype='object')
df2.columns
for col in df2.columns:
print(col)
chk
chf
From the dataview it looks like 'id', 'var1', 'var2' have been merged into one common index:
Screenprint dataview here
Can someone please help me? :)

Python + Pandas : Copying specific columns of the first row of multiple CSVs and store the rows to a single csv

I have about 190 CSV's. each of which has same column names. A sample csv shared below:
From every csv, I need to pick only the Item, Predicted_BelRd(D2), Predicted_Ulsoor(D2), Predicted_ChrchStrt(D2), Predicted_BlrClub(D2),
Predicted_Indrangr(D1), Predicted_Krmngl(D1), Predicted_KrmnglBkry(D1), Predicted_HSR(D1) columns of only the first row, and need to store all these rows to a separate CSV. So the final CSV should 190 rows.
How to do that?
Edit:
Code so far, as suggested by DavidDR:
path = '/home/hp/products1'
all_files = glob.glob(path + "/*.csv")
#print(all_files)
columns = ['Item', 'Predicted_BelRd(D2)', 'Predicted_Ulsoor(D2)', 'Predicted_ChrchStrt(D2)', 'Predicted_BlrClub(D2)', 'Predicted_Indrangr(D1)', 'Predicted_Krmngl(D1)', 'Predicted_KrmnglBkry(D1)', 'Predicted_HSR(D1)']
rows_list = []
for filename in all_files:
origin_data = pd.read_csv(filename)
my_data = origin_data[columns]
rows_list.append(my_data.head(1))
output = pd.DataFrame(rows_list)
#output.to_csv(file_name, sep='\t', encoding='utf-8')
output.to_csv('smallys_final.csv', encoding='utf-8', index=False)
Edit2 :
The original dataframe:
prod = pd.read_csv('/home/hp/products1/' + 'prod[' + str(0) + '].csv', engine='python')
print(prod)
Output:
Category Item UOM BelRd(D2) Ulsoor(D2) \
0 Food/Bakery BAKING POWDER SPARSH (1KGS) PKT 0 0
1 Food/Bakery BAKING POWDER SPARSH (1KGS) PKT 0 0
2 Food/Bakery BAKING POWDER SPARSH (1KGS) PKT 0 0
3 Food/Bakery BAKING POWDER SPARSH (1KGS) PKT 0 0
4 Food/Bakery BAKING POWDER SPARSH (1KGS) PKT 0 0
ChrchStrt(D2) BlrClub(D2) Indrangr(D1) Krmngl(D1) KrmnglBkry(D1) \
0 0 0 0 0 1
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 1
HSR(D1) date Predicted_BelRd(D2) Predicted_Ulsoor(D2) \
0 0 10 FEB 19 0.0 0.0
1 0 17 FEB 19 NaN NaN
2 0 24 FEB 19 NaN NaN
3 0 4 MARCH 19 NaN NaN
4 0 11 MARCH 19 NaN NaN
Predicted_ChrchStrt(D2) Predicted_BlrClub(D2) Predicted_Indrangr(D1) \
0 0.0 0.0 0.0
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
Predicted_Krmngl(D1) Predicted_KrmnglBkry(D1) Predicted_HSR(D1)
0 0.0 0.0 0.0
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
3 0 4 MARCH 19
4 0 11 MARCH 19
Here you go:
def function():
firstrows = [] # to collect 190 dataframes, each only 1 row
for filename in csvnames:
# read CSV, filter for a subset of columns, take only first row
df = pd.read_csv(filename) \
.filter(["Item", "Predicted_BelRd(D2)", ...]) \
.iloc[:1]
firstrows.append(df)
return pd.concat(firstrows)
Didn't check but this should work.
You basically read all you csv files from the same location, then you choose only the relevant columns. Then you pop out the first row and append it to a list of all the first rows. in the end you create a new DataFrame from the first row's list and you save it to one csv file.
import glob
import pandas as pd
path = # use your path
all_files = glob.glob(path + "/*.csv")
columns = ['Item', 'Predicted_BelRd(D2)', 'Predicted_Ulsoor(D2)', 'Predicted_ChrchStrt(D2)', 'Predicted_BlrClub(D2)', 'Predicted_Indrangr(D1)', 'Predicted_Krmngl(D1)', 'Predicted_KrmnglBkry(D1)', 'Predicted_HSR(D1)']
rows_list = []
for filename in all_files:
origin_data = pd.read_csv(filename)
my_data = origin_data[columns]
rows_list.append(my_data.head(1))
output = pd.DataFrame(rows_list)
output.to_csv(file_name, sep='\t', encoding='utf-8')

Reindexing after a pivot in pandas

Consider the following dataset:
After running the code:
convert_dummy1 = convert_dummy.pivot(index='Product_Code', columns='Month', values='Sales').reset_index()
The data is in the right form, but my index column is named 'Month', and I cannot seem to remove this at all. I have tried codes such as the below, but it does not do anything.
del convert_dummy1.index.name
I can save the dataset to a csv, delete the ID column, and then read the csv - but there must be a more efficient way.
Dataset after reset_index():
convert_dummy1
Month Product_Code 0 1 2 3 4
0 10133.9 0 0 0 0 0
1 10146.9 120 80 60 0 100
convert_dummy1.index = pd.RangeIndex(len(convert_dummy1.index))
del convert_dummy1.columns.name
convert_dummy1
Product_Code 0 1 2 3 4
0 10133.9 0 0 0 0 0
1 10146.9 120 80 60 0 100
Since you pivot with columns="Month", each column in output corresponds to a month. If you decide to reset index after the pivot, you should check column names with convert_dummy1.columns.value which should return in your case :
array(['Product_Code', 1, 2, 3, 4, 5], dtype=object)
while convert_dummy1.columns.names should return:
FrozenList(['Month'])
So to rename Month, use rename_axis function:
convert_dummy1.rename_axis('index',axis=1)
Output:
index Product_Code 1 2 3 4 5
0 10133 NaN NaN NaN NaN 0.0
1 10234 NaN 0.0 NaN NaN NaN
2 10245 0.0 NaN NaN NaN NaN
3 10345 NaN NaN NaN 0.0 NaN
4 10987 NaN NaN 1.0 NaN NaN
If you wish to reproduce it, this is my code:
df1=pd.DataFrame({'Product_Code':[10133,10245,10234,10987,10345], 'Month': [1,2,3,4,5], 'Sales': [0,0,0,1,0]})
df2=df1.pivot_table(index='Product_Code', columns='Month', values='Sales').reset_index().rename_axis('index',axis=1)

Categories

Resources