I have a dataframe like as shown below
tdf = pd.DataFrame(
{'Unnamed: 0' : ['Region','Asean','Asean','Asean','Asean','Asean','Asean'],
'Unnamed: 1' : ['Name', 'DEF', 'GHI', 'JKL', 'MNO', 'PQR','STU'],
'2017Q1' : ['target_achieved',2345,5678,7890,1234,6789,5454],
'2017Q1' : ['target_set', 3000,6000,8000,1500,7000,5500],
'2017Q1' : ['score', 86, 55, 90, 65, 90, 87],
'2017Q2' : ['target_achieved',245,578,790,123,689,454],
'2017Q2' : ['target_set', 300,600,800,150,700,500],
'2017Q2' : ['score', 76, 45, 70, 55, 60, 77]})
As you can see that, my column names are duplicated.
Meaning, there are 3 columns (2017Q1 each and 2017Q2 each)
dataframe doesn't allow to have columns with duplicate names.
I tried the below to get my expected output
tdf.columns = tdf.iloc[0]v # but this still ignores the column with duplicate names
update
After reading the excel file, based on jezrael answer, I get the below display
I expect my output to be like as shown below
First create MultiIndex in columns and indices:
df = pd.read_excel(file, header=[0,1], index_col=[0,1])
If not possible, here is alternative from your sample data - converted columns and first row of data to MultiIndex in columns and first columns to MultiIndex in index:
tdf = pd.read_excel(file)
tdf.columns = pd.MultiIndex.from_arrays([tdf.columns, tdf.iloc[0]])
df = (tdf.iloc[1:]
.set_index(tdf.columns[:2].tolist())
.rename_axis(index=['Region','Name'], columns=['Year',None]))
print (df.index)
MultiIndex([('Asean', 'DEF'),
('Asean', 'GHI'),
('Asean', 'JKL'),
('Asean', 'MNO'),
('Asean', 'PQR'),
('Asean', 'STU')],
names=['Region', 'Name'])
print (df.columns)
MultiIndex([('2017Q1', 'target_achieved'),
('2017Q1', 'target_set'),
('2017Q1', 'score'),
('2017Q2', 'target_achieved'),
('2017Q2', 'target_set'),
('2017Q2', 'score')],
names=['Year', None])
And then reshape:
df1 = df.stack(0).reset_index()
print (df1)
Region Name Year score target_achieved target_set
0 Asean DEF 2017Q1 86 2345 3000
1 Asean DEF 2017Q2 76 245 300
2 Asean GHI 2017Q1 55 5678 6000
3 Asean GHI 2017Q2 45 578 600
4 Asean JKL 2017Q1 90 7890 8000
5 Asean JKL 2017Q2 70 790 800
6 Asean MNO 2017Q1 65 1234 1500
7 Asean MNO 2017Q2 55 123 150
8 Asean PQR 2017Q1 90 6789 7000
9 Asean PQR 2017Q2 60 689 700
10 Asean STU 2017Q1 87 5454 5500
11 Asean STU 2017Q2 77 454 500
EDIT: Solution for EDITed question is similar:
df = pd.read_excel(file, header=[0,1], index_col=[0,1])
df1 = df.rename_axis(index=['Region','Name'], columns=['Year',None]).stack(0).reset_index()
I have a dataframe df1:
df1 = pd.DataFrame([['a','Yes','abc def msg1'],
['b', 'No', 'ghi jkl msg2'],
['c','Yes','mno pqr msg3'],
['d', 'No', 'stu vwx msg4'],
['a', 'Yes', 'bcd efg msg5'],
['c','No','hij klm msg6'],
['a','No','nop qrs msg7'],
['b','No','tuv wxy msg8']],
columns=['unit_name','is_required','dummy_column'])
unit_name is_required dummy_column
a Yes abc def msg1
b No ghi jkl msg2
c Yes mno pqr msg3
d No stu vwx msg4
a Yes bcd efg msg5
c No hij klm msg6
a No nop qrs msg7
b No tuv wxy msg8
Whose rows having unit_name = 'a' and is_required='Yes' are used to derive another dataframe df2:
dummy1 dummy2 msg_column value
abc def msg1 val1
bcd efg msg5 val2
Now I want to add the value column of df2 to df1. The rows that don't have the value must contain '-'. So the expected output I want is:
unit_name is_required dummy_column value
a Yes abc def msg1 val1
b No ghi jkl msg2 -
c Yes mno pqr msg3 -
d No stu vwx msg4 -
a Yes bcd efg msg5 val2
c No hij klm msg6 -
a No nop qrs msg7 -
b No tuv wxy msg8 -
In order to do this, I tried the below line of code:
df1.loc[(df1.unit_name=='a') & (df1.is_required=='Yes'),'value'] = df2['value']
df1.fillna('-')
But I'm getting the result:
unit_name is_required dummy_column value
a Yes abc def msg1 val1
b No ghi jkl msg2 -
c Yes mno pqr msg3 -
d No stu vwx msg4 -
a Yes bcd efg msg5 -
c No hij klm msg6 -
a No nop qrs msg7 -
b No tuv wxy msg8 -
Now I understand that this is happening because while equating two columns, the index value of the LHS will be used to get the values from RHS.
How do I get the output I need? Any ideas are welcome. Thanks in advance!
Problem is there is different indices in both DataFrames.
Possible solution if number of filtered rows is same like number rows in df2:
print (((df1.unit_name=='a') & (df1.is_required=='Yes')).sum(), len(df2.index))
Then is possible use:
df1.loc[(df1.unit_name=='a') & (df1.is_required=='Yes'),'value'] = df2['value'].to_numpy()
I have a dataframe that looks like the following:
print(df):
Text
John Smith abc def ghi jkl
Michael Smith abc def ghi jkl
Liz Jones abc def ghi jkl
I also have a predefined list of people who i want to find and split the above contents into two columns.
names = ('John Smith','Michael Smith','Liz Jones')
I am hoping to get the following:
print(df):
Name | Information
John Smith | abc def ghi jkl
Michael Smith | abc def ghi jkl
Liz Jones | abc def ghi jkl
i have tried:
df['Name','Information'] = df['Text'].str.split(names)
but i think the str.split needs a string and doesnt take a list of names. Is there anyway to split columns off a defined list?
Any help would be much appreciated. Thanks very much
Use Series.str.extract with joined all names by | for regex or and then for all another values:
names = ('John Smith','Michael Smith','Liz Jones')
df = df['Text'].str.extract(f'(?P<Name>{"|".join(names)})(?P<Information>.*)')
print (df)
Name Information
0 John Smith abc def ghi jkl
1 Michael Smith abc def ghi jkl
2 Liz Jones abc def ghi jkl
If want remove this column and add all another columns for original use DataFrame.pop for extract column and DataFrame.join:
df = df.join(df.pop('Text').str.extract(f'(?P<Name>{"|".join(names)})(?P<Information>.*)'))
Or:
df[['Name','Information']] = df.pop('Text').str.extract(f'(?P<letter>{"|".join(names)})(.*)')
print (df)
Name Information
0 John Smith abc def ghi jkl
1 Michael Smith abc def ghi jkl
2 Liz Jones abc def ghi jkl
I get a Pandas series:
countrypat = asiaselect.groupby('Country')['Pattern'].value_counts().groupby(level=0).head(3)
The output looks like this:
China abc 1055
def 778
ghi 612
Malaysia def 554
abc 441
ghi 178
[...]
How to insert a new column (do I have to make this a dataframe) containing the ratio of the numeric column to the sum of the numbers for that country. Thus for China I would want a new column and the first row would contain (1055/(1055+778+612)). I have tried unstack() and to_df() but was unsure of the next steps.
I created a dataframe on my side, but excluded the .head(3) of your assigment:
countrypat = asiaselect.groupby('Country')['Pattern'].value_counts().groupby(level=0)
The following will give you the proportions with a simple apply to your groupby object:
countrypat.apply(lambda x: x / float(x.sum()))
The only 'problem' is that doing so returns you a series, so I would stock the intermediate results in two different series and combine them at the end:
series1 = asiaselect.groupby('Country')['Pattern'].value_counts()
series2 = asiaselect.groupby('Country')['Pattern'].value_counts().groupby(level=0).apply(lambda x: x / float(x.sum()))
pd.DataFrame([series1, series2]).T
China abc 1055.0 0.431493
def 778.0 0.318200
ghi 612.0 0.250307
Malaysia def 554.0 0.472293
abc 441.0 0.375959
ghi 178.0 0.151748
As to get the top three rows, you can simply add a .groupby(level=0).head(3) to each series1 and series2
series1_top = series1.groupby(level=0).head(3)
series2_top = series2.groupby(level=0).head(3)
pd.DataFrame([series1_top, series2_top]).T
I tested with a dataframe containing more than 3 rows, and it seems to work. Started with the following df:
China abc 1055
def 778
ghi 612
yyy 5
xxx 3
zzz 3
Malaysia def 554
abc 441
ghi 178
yyy 5
xxx 3
zzz 3
and ends like this:
China abc 1055.0 0.429560
def 778.0 0.316775
ghi 612.0 0.249186
Malaysia def 554.0 0.467905
abc 441.0 0.372466
ghi 178.0 0.150338
The sql operation is as below :
UPDATE table_A s SET t.stat_fips=s.stat_fips
WHERE t.stat_code=s.stat_code;
If a similar operation needs to be done on csv A comparing some value from csv B How to achieve this in Python?
Data:
Lets assume -
CSV A
col1 stat_code name
abc WY ABC
def NA DEF
ghi AZ GHI
CSV B
stat_fips stat_code
2234 WY
4344 NA
4588 AZ
Resulting CSV :
col1 stat_code name stat_fips
abc WY ABC 2234
def NA DEF 4344
ghi AZ GHI 4588
Adding the attempted code so far :
df = pd.read_csv('fin.csv',sep='\t', quotechar="'")
df = df.set_index('col1').stack(dropna=False).reset_index
df1['stat_fips'] = df1['stat_code']
print df1
(Not really sure on pandas. learning the basics yet)
Judging your example data, this looks like merge operation on your stat_code column:
import pandas as pd
df_a = pd.DataFrame([["abc", "WY", "ABC"], ["def", "NA", "DEF"]], columns= ["col1", "stat_code", "name"])
df_b = pd.DataFrame([[2234, "WY"], [4344, "NA"]], columns=["stat_fips", "stat_code"])
merged_df = pd.merge(df_a, df_b, on="stat_code", how="left")
print(merged_df)
col1 stat_code name stat_fips
0 abc WY ABC 2234
1 def NA DEF 4344
It seems you need map by dict d:
d = df2.set_index('stat_code')['stat_fips'].to_dict()
df1['stat_fips'] = df1['stat_code'].map(d)
print (df1)
col1 stat_code name stat_fips
0 abc WY ABC 2234
1 def NaN DEF 4344
2 ghi AZ GHI 4588
Or merge with left join:
df3 = pd.merge(df1, df2, on='stat_code', how='left')
print (df3)
col1 stat_code name stat_fips
0 abc WY ABC 2234
1 def NaN DEF 4344
2 ghi AZ GHI 4588