This is my first dataframe.
Date 0 1 2 3
2003-01-31 CA KY ID CO
2003-02-28 CA KY HI CO
2003-03-31 CA KY CO HI
This is my second dataframe.
Date CA KY ID CO HI
2003-01-31 5 3 4 5 1
2003-02-28 2 7 8 4 5
2003-03-31 6 3 9 3 5
How do I get this dataframe to print as output?
Date 0 1 2 3
2003-01-31 5 3 4 5
2003-02-28 2 7 5 4
2003-03-31 6 3 3 5
I am wondering if there is a way to use the whole dataframe as an index to another instead of having to loop through all the dates/columns.
You can use df.lookup with df.apply here.
# If `Date` is not index.
# df1.set_index('Date')
# 0 1 2 3
# Date
# 2003-01-31 CA KY ID CO
# 2003-02-28 CA KY HI CO
# 2003-03-31 CA KY CO HI
# df2.set_index('Date')
# CA KY ID CO HI
# Date
# 2003-01-31 5 3 4 5 1
# 2003-02-28 2 7 8 4 5
# 2003-03-31 6 3 9 3 5
def f(x):
return df2.lookup(x.index, x)
df1.apply(f)
# df1.apply(lambda x: df2.lookup(x.index, x)
0 1 2 3
Date
2003-01-31 5 3 4 5
2003-02-28 2 7 5 4
2003-03-31 6 3 3 5
This will print the values of dataframe df and the column name of dataframe df1.
print(df.rename(columns={i:j for i,j in zip(df.columns.tolist(),df1.columns.tolist())}))
if you want to make the changes permanent, add the parameter inplace=TRUE
Related
I have a pandas DataFrame (ignore the indices of the DataFrame)
Tab Ind Com Val
4 BAS 1 1 10
5 BAS 1 2 5
6 BAS 2 1 20
8 AIR 1 1 5
9 AIR 1 2 2
11 WTR 1 1 2
12 WTR 2 1 1
And a pandas series
Ind
1 1.208333
2 0.857143
dtype: float64
I want to multiply each element of the Val column of the DataFrame with the element of the series that has the same Ind value. How would I approach this? pandas.DataFrame.mul only matches on index, but I don't want to transform the DataFrame.
Looks like pandas.DataFrame.join could solve your problem:
temp = df.join(the_series,on='Ind', lsuffix='_orig')
df['ans'] = temp.Val*temp.Ind
Output
Tab Ind Com Val ans
4 BAS 1 1 10 12.083330
5 BAS 1 2 5 6.041665
6 BAS 2 1 20 17.142860
8 AIR 1 1 5 6.041665
9 AIR 1 2 2 2.416666
11 WTR 1 1 2 2.416666
12 WTR 2 1 1 0.857143
Or another way to achieve the same using a more compact syntax (thanks W-B)
df1['New']=df1.Ind.map(the_series).values*df1.Val
Here is the dataframe:
A B val val2 loc
1 march 3 2 NY
1 april 5 1 NY
1 may 12 4 NY
2 march 4 1 NJ
2 april 7 5 NJ
2 may 12 1 NJ
3 march 1 8 CA
3 april 54 6 CA
3 may 2 9 CA
I'd like to transform this into:
march march april april may may
val1 val2 val1 val2 val1 val2
A B
1 NY 3 5 12 2 1 4
2 NJ 4 7 12 1 5 5
3 CA 1 54 2 8 6 9
I'm looking into pivot tables and stacking and unstacking but im truly stuck. I'm not sure where to start
With pd.pivot_table, and some swapping of levels:
new_df = (pd.pivot_table(df,['val','val2'],['A','loc'],['B'])
.sort_index(axis=1, level=1)
.swaplevel(0, axis=1))
>>> new_df
B april march may
val val2 val val2 val val2
A loc
1 NY 5 1 3 2 12 4
2 NJ 7 5 4 1 12 1
3 CA 54 6 1 8 2 9
If the ordering of your columns is important (as in you need it as march, april and may), you can set it to a ordered categorical:
new_df = (pd.pivot_table(df,['val','val2'],['A','loc'],
[pd.Categorical(df.B, categories=['march','april','may'],
ordered=True)])
.dropna(how='all')
.sort_index(axis=1, level=1)
.swaplevel(0, axis=1))
>>> new_df
B march april may
val val2 val val2 val val2
A loc
1 NY 3.0 2.0 5.0 1.0 12.0 4.0
2 NJ 4.0 1.0 7.0 5.0 12.0 1.0
3 CA 1.0 8.0 54.0 6.0 2.0 9.0
How to select rows from a DataFrame based on string values in a column in pandas? I just want to display the just States only which are in all CAPS.
The states have the total number of cities.
import pandas as pd
import matplotlib.pyplot as plt
%pylab inline
d = pd.read_csv("states.csv")
print(d)
print(df)
# States/cities B C D
# 0 FL 3 5 6
# 1 Orlando 1 2 3
# 2 Miami 1 1 3
# 3 Jacksonville 1 2 0
# 4 CA 8 3 2
# 5 San diego 3 1 0
# 6 San Francisco 5 2 2
# 7 WA 4 2 1
# 8 Seattle 3 1 0
# 9 Tacoma 1 1 1
How to display like so,
# States/Cites B C D
# 0 FL 3 5 6
# 4 CA 8 3 2
# 7 WA 4 2 1
You can write a function to be applied to each value in the States/cities column. Have the function return either True or False, and the result of applying the function can act as a Boolean filter on your DataFrame.
This is a common pattern when working with pandas. In your particular case, you could check for each value in States/cities whether it's made of only uppercase letters.
So for example:
def is_state_abbrev(string):
return string.isupper()
filter = d['States/cities'].apply(is_state_abbrev)
filtered_df = d[filter]
Here filter will be a pandas Series with True and False values.
You can also achieve the same result by using a lambda expression, as in:
filtered_df = d[d['States/cities'].apply(lambda x: x.isupper())]
This does essentially the same thing.
Consider pandas.Series.str.match passing a regex for only [A-Z]
states[states['States/cities'].str.match('^.*[A-Z]$')]
# States/cities B C D
# 0 FL 3 5 6
# 4 CA 8 3 2
# 7 WA 4 2 1
Data
from io import StringIO
import pandas as pd
txt = '''"States/cities" B C D
0 FL 3 5 6
1 Orlando 1 2 3
2 Miami 1 1 3
3 Jacksonville 1 2 0
4 CA 8 3 2
5 "San diego" 3 1 0
6 "San Francisco" 5 2 2
7 WA 4 2 1
8 Seattle 3 1 0
9 Tacoma 1 1 1'''
states = pd.read_table(StringIO(txt), sep="\s+")
You can get the rows with all uppercase values in the column States/cities like this:
df.loc[df['States/cities'].str.isupper()]
States/cities B C D
0 FL 3 5 6
4 CA 8 3 2
7 WA 4 2 1
Just to be safe, you can add a condition so that it only returns the rows where 'States/cities' is uppercase and only 2 characters long (in case you had a value that was SEATTLE or something like that):
df.loc[(df['States/cities'].str.isupper()) & (df['States/cities'].apply(len) == 2)]
You can use str.contains to filter any row that contains small alphabets
df[~df['States/cities'].str.contains('[a-z]')]
States/cities B C D
0 FL 3 5 6
4 CA 8 3 2
7 WA 4 2 1
If we assuming the order is always State followed by the city from the state , we can using where and dropna
df['States/cities']=df['States/cities'].where(df['States/cities'].isin(['FL','CA','WA']))
df.dropna()
df
States/cities B C D
0 FL 3 5 6
4 CA 8 3 2
7 WA 4 2 1
Or we do str.len
df[df['States/cities'].str.len()==2]
Out[39]:
States/cities B C D
0 FL 3 5 6
4 CA 8 3 2
7 WA 4 2 1
I have a dataframe df1, and I have a list which contains names of several columns of df1.
df1:
User_id month day Age year CVI ZIP sex wgt
0 1 7 16 1977 2 NA M NaN
1 2 7 16 1977 3 NA M NaN
2 3 7 16 1977 2 DM F NaN
3 4 7 16 1977 7 DM M NaN
4 5 7 16 1977 3 DM M NaN
... ... ... ... ... ... ... ... ...
35544 35545 12 31 2002 15 AH NaN NaN
35545 35546 12 31 2002 15 AH NaN NaN
35546 35547 12 31 2002 10 RM F 14
35547 35548 12 31 2002 7 DO M 51
35548 35549 12 31 2002 5 NaN NaN NaN
list= [u"User_id", u"day", u"ZIP", u"sex"]
I want to make a new dataframe df2 which will contain omly those columns which are in the list, and a dataframe df3 which will contain columns which are not in the list.
Here I found that I need to do:
df2=df1[df1[df1.columns[1]].isin(list)]
But as a result I get:
Empty DataFrame
Columns: []
Index: []
[0 rows x 9 columns]
What Im I odoing wrong and how can i get a needed result? Why "9 columns" if it supossed to be 4?
Solution with Index.difference:
L = [u"User_id", u"day", u"ZIP", u"sex"]
df2 = df1[L]
df3 = df1[df1.columns.difference(df2.columns)]
print (df2)
User_id day ZIP sex
0 0 7 NaN M
1 1 7 NaN M
2 2 7 DM F
3 3 7 DM M
4 4 7 DM M
print (df3)
Age CVI month wgt year
0 16 2 1 NaN 1977
1 16 3 2 NaN 1977
2 16 2 3 NaN 1977
3 16 7 4 NaN 1977
4 16 3 5 NaN 1977
Or:
df2 = df1[L]
df3 = df1[df1.columns.difference(pd.Index(L))]
print (df2)
User_id day ZIP sex
0 0 7 NaN M
1 1 7 NaN M
2 2 7 DM F
3 3 7 DM M
4 4 7 DM M
print (df3)
Age CVI month wgt year
0 16 2 1 NaN 1977
1 16 3 2 NaN 1977
2 16 2 3 NaN 1977
3 16 7 4 NaN 1977
4 16 3 5 NaN 1977
never name a list as "list"
my_list= [u"User_id", u"day", u"ZIP", u"sex"]
df2 = df1[df1.keys()[df1.keys().isin(my_list)]]
or
df2 = df1[df1.columns[df1.columns.isin(my_list)]]
You can try :
df2 = df1[list] # it does a projection on the columns contained in the list
df3 = df1[[col for col in df1.columns if col not in list]]
never name a list as "list"
my_list= [u"User_id", u"day", u"ZIP", u"sex"]
df2 = df1[df1.keys()[df1.keys().isin(my_list)]]
I have a Pandas dataframe groupby object which looks like the following:
ID
2014-11-30 1
2
3
2014-12-31 1
2
3
4
2015-01-31 2
3
4
2015-02-28 1
3
4
5
2015-03-31 1
2
4
5
6
2015-04-30 3
4
5
6
What I want to do is create another dataframe where the values in groupby date x are values that are in each of groupby dates y(x-1) thru y(x-n) where y is the n period previous groupby. So for instance, if n=1, then if x groupby period is '2015-04-30', then you would check against '2015-03-31'. If n=2, then if groupby date '2015-02-28', then you would check against groupby dates ['2015-01-31', '2014-12-31'].
The resulting dataframe from the above would look like this for n=1:
ID
2014-12-31 1
2
3
2015-01-31 2
3
4
2015-02-28 3
4
2015-03-31 1
4
5
2015-04-30 4
5
6
The resulting dataframe for n=2 would be:
2015-01-31 2
3
2015-02-28 3
4
2015-03-31 4
2015-04-30 4
5
Looking forward to some pythonic solutions!
This would seem to work:
def filter_unique(df, n):
data_by_date = df.groupby('date')['ID'].apply(lambda x: x.tolist())
filtered_data = {}
previous = []
for i, (date, data) in enumerate(data_by_date.items()):
if i >= n:
if len(previous)==1:
filtered_data[date] = list(set(previous[i-n]).intersection(data))
else:
filtered_data[date] = list(set.intersection(*[set(x) for x in previous[i-n:]]).intersection(data))
else:
filtered_data[date] = data
previous.append(data)
result = pd.DataFrame.from_dict(filtered_data, orient='index').stack()
result.index = result.index.droplevel(1)
filter_unique(df, 2)
1/31/15 2
1/31/15 3
1/31/15 4
11/30/14 1
11/30/14 2
11/30/14 3
12/31/14 2
12/31/14 3
2/28/15 1
2/28/15 3
3/31/15 1
3/31/15 4
4/30/15 4
4/30/15 5