I am working on a dataframe of shape 146 rows x 48 columns. The columns are
['Region','Rank 2015','Score 2015','Economy 2015','Family 2015','Health 2015','Freedom 2015','Generosity 2015','Trust 2015','Rank 2016','Score 2016','Economy 2016','Family 2016','Health 2016','Freedom 2016','Generosity 2016','Trust 2016','Rank 2017','Score 2017','Economy 2017','Family 2017','Health 2017','Freedom 2017','Generosity 2017','Trust 2017','Rank 2018','Score 2018','Economy 2018','Family 2018','Health 2018','Freedom 2018','Generosity 2018','Trust 2018','Rank 2019','Score 2019','Economy 2019','Family 2019','Health 2019','Freedom 2019','Generosity 2019','Trust 2019','Score Mean','Economy Mean','Family Mean','Health Mean','Freedom Mean','Generosity Mean','Trust Mean']
I want to access a particular row and want to convert it to to the following dataframe
Year Rank Score Family Health Freedom Generosity Trust
0 2015 NaN NaN NaN NaN NaN NaN NaN
1 2016 NaN NaN NaN NaN NaN NaN NaN
2 2017 NaN NaN NaN NaN NaN NaN NaN
3 2018 NaN NaN NaN NaN NaN NaN NaN
4 2019 NaN NaN NaN NaN NaN NaN NaN
Any help is welcomed & Thank you in advance.
An alternate way:
cols=['Region','Rank 2015','Score 2015','Economy 2015','Family 2015','Health 2015','Freedom 2015','Generosity 2015', 'Trust 2015','Rank 2016','Score 2016','Economy 2016','Family 2016','Health 2016','Freedom 2016','Generosity 2016','Trust 2016', 'Rank 2017','Score 2017','Economy 2017','Family 2017','Health 2017','Freedom 2017','Generosity 2017','Trust 2017','Rank 2018','Score 2018','Economy 2018','Family 2018','Health 2018','Freedom 2018','Generosity 2018','Trust 2018','Rank 2019','Score 2019','Economy 2019','Family 2019','Health 2019','Freedom 2019','Generosity 2019','Trust 2019','Score Mean','Economy Mean','Family Mean','Health Mean','Freedom Mean','Generosity Mean','Trust Mean']
# source dataframe
df1 = pd.DataFrame(columns=cols)
df1.loc[0] = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
#target dataframe
df2 = pd.DataFrame(columns=['Year','Rank','Score','Family','Health','Freedom','Generosity','Trust','Economy'])
df2['Year']=['2015','2016','2017','2018','2019','Mean']
df2.set_index('Year', inplace=True)
idx = 0 # source row to copy
for col in df1.columns[1:]:
c,r = col.split(" ")
df2.at[r,c] = df1.at[idx, col]
print (df2)
Rank Score Family Health Freedom Generosity Trust Economy
Year
2015 1 1 1 1 1 1 1 1
2016 1 1 1 1 1 1 1 1
2017 1 1 1 1 1 1 1 1
2018 1 1 1 1 1 1 1 1
2019 1 1 1 1 1 1 1 1
Mean NaN 1 1 1 1 1 1 1
Here's a solution utilizing list comprehension:
The input:
cols = ['Region','Rank 2015','Score 2015','Economy 2015','Family 2015','Health 2015','Freedom 2015','Generosity 2015','Trust 2015','Rank 2016','Score 2016','Economy 2016','Family 2016','Health 2016','Freedom 2016','Generosity 2016','Trust 2016','Rank 2017','Score 2017','Economy 2017','Family 2017','Health 2017','Freedom 2017','Generosity 2017','Trust 2017','Rank 2018','Score 2018','Economy 2018','Family 2018','Health 2018','Freedom 2018','Generosity 2018','Trust 2018','Rank 2019','Score 2019','Economy 2019','Family 2019','Health 2019','Freedom 2019','Generosity 2019','Trust 2019','Score Mean','Economy Mean','Family Mean','Health Mean','Freedom Mean','Generosity Mean','Trust Mean']
df = pd.DataFrame(np.random.randint(1,10,(3,48)))
df.columns = cols
print(df.iloc[:, :4])
Region Rank 2015 Score 2015 Economy 2015
0 7 9 9 9
1 8 7 2 3
2 3 3 4 5
And the new dataframe would be:
target_cols = ['Rank', 'Score', 'Family', 'Health', 'Freedom', 'Generosity', 'Trust']
years = ['2015', '2016', '2017', '2018', '2019']
newdf = pd.DataFrame([df.loc[1, [x + ' ' + year for x in target_cols]].values for year in years])
newdf.columns = target_cols
newdf['year'] = years
print(newdf)
Rank Score Family Health Freedom Generosity Trust year
0 7 2 6 9 3 4 9 2015
1 2 8 1 1 7 6 1 2016
2 7 4 2 5 1 7 4 2017
3 9 7 1 4 7 5 2 2018
4 5 4 4 9 1 6 2 2019
Assuming that you have only the target years are those spanning between 2015 and 2019; and that the target columns are known.
I would procede as follows:
(1) define the target columns and years
target_columns = ['Rank', 'Score', 'Family', 'Health', 'Freedom', 'Generosity', 'Trust'] target_years = ['2015', '2016', '2017', '2018', '2019']
(2) retrieve the particular row, I assume your starting dataframe to be initial_dataframe
particular_row = initial_dataframe.iloc[0]
(3) retrieve and reshape the information from the particular_row
reshaped_row = { 'Year': target_years }
reshaped_row.update({ column_name: [ particular_row[column_name + ' ' + year_name] for year_name in target_years ] for column_name in target_columns })
(4) assign the reshaped row to the output_dataframe
output_dataframe = pd.Dataframe(reshaped_row)
Have you tried using a 2D array? I would find that to be the easiest. Otherwise, you could also use a dictionary. https://www.w3schools.com/python/python_dictionaries.asp
I didn't get your question properly but I can give you hint how to translate the data.
df = pd.DataFrame(li)
df = df[0].str.split("(\d{4})", expand=True)
df = df[df[2]==""]
col_name = df[0].unique()
df_new = df.pivot(index=1, columns=0, values=2)
df_new.drop(df_new.index[0], inplace=True)
df_new:
Economy Family Freedom Generosity Health Rank Score Trust
1
2016
2017
2018
2019
You can write your own logic.
It needs a lot of manipulation, a simple idea is to modify to required dict and then make df
In [61]: dicts = {}
In [62]: for t in text[1:]:
...: n,y = t.split(" ")
...: if n not in dicts:
...: dicts[n]=[]
...: if y !="Mean":
...: if n == 'Rank':
...: dicts[n].append(y)
...: else:
...: dicts[n].append(pd.np.NaN)
...:
In [63]: df = pd.DataFrame(dicts)
In [64]: df['Year'] = df['Rank']
In [65]: df['Rank'] = df['Family']
In [66]: df
Out[66]:
Rank Score Economy Family Health Freedom Generosity Trust Year
0 NaN NaN NaN NaN NaN NaN NaN NaN 2015
1 NaN NaN NaN NaN NaN NaN NaN NaN 2016
2 NaN NaN NaN NaN NaN NaN NaN NaN 2017
3 NaN NaN NaN NaN NaN NaN NaN NaN 2018
4 NaN NaN NaN NaN NaN NaN NaN NaN 2019
I have a Dataframe of the form
date_time uids
2018-10-16 23:00:00 1000,1321,7654,1321
2018-10-16 23:10:00 7654
2018-10-16 23:20:00 NaN
2018-10-16 23:30:00 7654,1000,7654,1321,1000
2018-10-16 23:40:00 691,3974,3974,323
2018-10-16 23:50:00 NaN
2018-10-17 00:00:00 NaN
2018-10-17 00:10:00 NaN
2018-10-17 00:20:00 27,33,3974,3974,7665,27
This is a very big data frame containing the 5 mins time interval and the number of appearances of ids during those time intervals.
I want to iterate over these DataFrame 6 rows at a time (corresponding to 1 hour) and create DataFrame containing the ID and the number of times each id appear during this time.
Expected output is one dataframe per hour information. For example, in the above case dataframe for the hour 23 - 00 will have this form
uid 1 2 3 4 5 6
1000 1 0 0 2 0 0
1321 2 0 0 1 0 0
and so on
How can I do this efficiently?
I don't have an exact solution but you could create a pivot table: ids on the index and datetimes on the columns. Then you just have to select the columns you want.
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"date_time": [
"2018-10-16 23:00:00",
"2018-10-16 23:10:00",
"2018-10-16 23:20:00",
"2018-10-16 23:30:00",
"2018-10-16 23:40:00",
"2018-10-16 23:50:00",
"2018-10-17 00:00:00",
"2018-10-17 00:10:00",
"2018-10-17 00:20:00",
],
"uids": [
"1000,1321,7654,1321",
"7654",
np.nan,
"7654,1000,7654,1321,1000",
"691,3974,3974,323",
np.nan,
np.nan,
np.nan,
"27,33,3974,3974,7665,27",
],
}
)
df["date_time"] = pd.to_datetime(df["date_time"])
df = (
df.set_index("date_time") #do not use set_index if date_time is current index
.loc[:, "uids"]
.str.extractall(r"(?P<uids>\d+)")
.droplevel(level=1)
) # separate all the ids
df["number"] = df.index.minute.astype(float) / 10 + 1 # get the number 1 to 6 depending on the minutes
df_pivot = df.pivot_table(
values="number",
index="uids",
columns=["date_time"],
) #dataframe with all the uids on the index and all the datetimes in columns.
You can apply this to the whole dataframe or just a subset containing 6 rows. Then you rename your columns.
You can use the function crosstab:
df['uids'] = df['uids'].str.split(',')
df = df.explode('uids')
df['date_time'] = df['date_time'].dt.minute.floordiv(10).add(1)
pd.crosstab(df['uids'], df['date_time'], dropna=False)
Output:
date_time 1 2 3 4 5 6
uids
1000 1 0 0 2 0 0
1321 2 0 0 1 0 0
27 0 0 2 0 0 0
323 0 0 0 0 1 0
33 0 0 1 0 0 0
3974 0 0 2 0 2 0
691 0 0 0 0 1 0
7654 1 1 0 2 0 0
7665 0 0 1 0 0 0
We can achieve this with extracting the minutes from your datetime column. Then using pivot_table to get your wide format:
df['date_time'] = pd.to_datetime(df['date_time'])
df['minute'] = df['date_time'].dt.minute // 10
piv = (df.assign(uids=df['uids'].str.split(','))
.explode('uids')
.pivot_table(index='uids', columns='minute', values='minute', aggfunc='size')
)
minute 0 1 2 3 4
uids
1000 1.0 NaN NaN 2.0 NaN
1321 2.0 NaN NaN 1.0 NaN
27 NaN NaN 2.0 NaN NaN
323 NaN NaN NaN NaN 1.0
33 NaN NaN 1.0 NaN NaN
3974 NaN NaN 2.0 NaN 2.0
691 NaN NaN NaN NaN 1.0
7654 1.0 1.0 NaN 2.0 NaN
7665 NaN NaN 1.0 NaN NaN
I have about 190 CSV's. each of which has same column names. A sample csv shared below:
From every csv, I need to pick only the Item, Predicted_BelRd(D2), Predicted_Ulsoor(D2), Predicted_ChrchStrt(D2), Predicted_BlrClub(D2),
Predicted_Indrangr(D1), Predicted_Krmngl(D1), Predicted_KrmnglBkry(D1), Predicted_HSR(D1) columns of only the first row, and need to store all these rows to a separate CSV. So the final CSV should 190 rows.
How to do that?
Edit:
Code so far, as suggested by DavidDR:
path = '/home/hp/products1'
all_files = glob.glob(path + "/*.csv")
#print(all_files)
columns = ['Item', 'Predicted_BelRd(D2)', 'Predicted_Ulsoor(D2)', 'Predicted_ChrchStrt(D2)', 'Predicted_BlrClub(D2)', 'Predicted_Indrangr(D1)', 'Predicted_Krmngl(D1)', 'Predicted_KrmnglBkry(D1)', 'Predicted_HSR(D1)']
rows_list = []
for filename in all_files:
origin_data = pd.read_csv(filename)
my_data = origin_data[columns]
rows_list.append(my_data.head(1))
output = pd.DataFrame(rows_list)
#output.to_csv(file_name, sep='\t', encoding='utf-8')
output.to_csv('smallys_final.csv', encoding='utf-8', index=False)
Edit2 :
The original dataframe:
prod = pd.read_csv('/home/hp/products1/' + 'prod[' + str(0) + '].csv', engine='python')
print(prod)
Output:
Category Item UOM BelRd(D2) Ulsoor(D2) \
0 Food/Bakery BAKING POWDER SPARSH (1KGS) PKT 0 0
1 Food/Bakery BAKING POWDER SPARSH (1KGS) PKT 0 0
2 Food/Bakery BAKING POWDER SPARSH (1KGS) PKT 0 0
3 Food/Bakery BAKING POWDER SPARSH (1KGS) PKT 0 0
4 Food/Bakery BAKING POWDER SPARSH (1KGS) PKT 0 0
ChrchStrt(D2) BlrClub(D2) Indrangr(D1) Krmngl(D1) KrmnglBkry(D1) \
0 0 0 0 0 1
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 1
HSR(D1) date Predicted_BelRd(D2) Predicted_Ulsoor(D2) \
0 0 10 FEB 19 0.0 0.0
1 0 17 FEB 19 NaN NaN
2 0 24 FEB 19 NaN NaN
3 0 4 MARCH 19 NaN NaN
4 0 11 MARCH 19 NaN NaN
Predicted_ChrchStrt(D2) Predicted_BlrClub(D2) Predicted_Indrangr(D1) \
0 0.0 0.0 0.0
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
Predicted_Krmngl(D1) Predicted_KrmnglBkry(D1) Predicted_HSR(D1)
0 0.0 0.0 0.0
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
3 0 4 MARCH 19
4 0 11 MARCH 19
Here you go:
def function():
firstrows = [] # to collect 190 dataframes, each only 1 row
for filename in csvnames:
# read CSV, filter for a subset of columns, take only first row
df = pd.read_csv(filename) \
.filter(["Item", "Predicted_BelRd(D2)", ...]) \
.iloc[:1]
firstrows.append(df)
return pd.concat(firstrows)
Didn't check but this should work.
You basically read all you csv files from the same location, then you choose only the relevant columns. Then you pop out the first row and append it to a list of all the first rows. in the end you create a new DataFrame from the first row's list and you save it to one csv file.
import glob
import pandas as pd
path = # use your path
all_files = glob.glob(path + "/*.csv")
columns = ['Item', 'Predicted_BelRd(D2)', 'Predicted_Ulsoor(D2)', 'Predicted_ChrchStrt(D2)', 'Predicted_BlrClub(D2)', 'Predicted_Indrangr(D1)', 'Predicted_Krmngl(D1)', 'Predicted_KrmnglBkry(D1)', 'Predicted_HSR(D1)']
rows_list = []
for filename in all_files:
origin_data = pd.read_csv(filename)
my_data = origin_data[columns]
rows_list.append(my_data.head(1))
output = pd.DataFrame(rows_list)
output.to_csv(file_name, sep='\t', encoding='utf-8')
Consider the following dataset:
After running the code:
convert_dummy1 = convert_dummy.pivot(index='Product_Code', columns='Month', values='Sales').reset_index()
The data is in the right form, but my index column is named 'Month', and I cannot seem to remove this at all. I have tried codes such as the below, but it does not do anything.
del convert_dummy1.index.name
I can save the dataset to a csv, delete the ID column, and then read the csv - but there must be a more efficient way.
Dataset after reset_index():
convert_dummy1
Month Product_Code 0 1 2 3 4
0 10133.9 0 0 0 0 0
1 10146.9 120 80 60 0 100
convert_dummy1.index = pd.RangeIndex(len(convert_dummy1.index))
del convert_dummy1.columns.name
convert_dummy1
Product_Code 0 1 2 3 4
0 10133.9 0 0 0 0 0
1 10146.9 120 80 60 0 100
Since you pivot with columns="Month", each column in output corresponds to a month. If you decide to reset index after the pivot, you should check column names with convert_dummy1.columns.value which should return in your case :
array(['Product_Code', 1, 2, 3, 4, 5], dtype=object)
while convert_dummy1.columns.names should return:
FrozenList(['Month'])
So to rename Month, use rename_axis function:
convert_dummy1.rename_axis('index',axis=1)
Output:
index Product_Code 1 2 3 4 5
0 10133 NaN NaN NaN NaN 0.0
1 10234 NaN 0.0 NaN NaN NaN
2 10245 0.0 NaN NaN NaN NaN
3 10345 NaN NaN NaN 0.0 NaN
4 10987 NaN NaN 1.0 NaN NaN
If you wish to reproduce it, this is my code:
df1=pd.DataFrame({'Product_Code':[10133,10245,10234,10987,10345], 'Month': [1,2,3,4,5], 'Sales': [0,0,0,1,0]})
df2=df1.pivot_table(index='Product_Code', columns='Month', values='Sales').reset_index().rename_axis('index',axis=1)