My data frame has 6 columns of dates which i want them to in 1 column
DATA FRAME IMAGE HERE
Code to make another column is as below
df['Mega'] = df['Mega'].append(df['RsWeeks','RsMonths','RsDays','PsWeeks','PsMonths','PsDays'])
i am new to python and pandas i would like to learn more so please point me sources too as i am really bad with debugging as i have no programming background.
Pandas documentation is a great source for good examples. Click here to visit a page with a lot of examples and visuals.
For your particular case:
We construct a sample DataFrame:
import pandas as pd
df = pd.DataFrame([
{"RsWeeks": "2015-11-10", "RsMonths": "2016-08-01"},
{"RsWeeks": "2015-11-11", "RsMonths": "2015-12-30"}
])
print("DataFrame preview:")
print(df)
Output:
DataFrame preview:
RsWeeks RsMonths
0 2015-11-10 2016-08-01
1 2015-11-11 2015-12-30
We concatenate the columns RsWeeks and RsMonths to create a Series:
my_series = pd.concat([df["RsWeeks"], df["RsMonths"]], ignore_index=True)
print("\nSeries preview:")
print(my_series)
Output:
Series preview:
0 2015-11-10
1 2015-11-11
2 2016-08-01
3 2015-12-30
Edit
If you really need to add the new Series as a column to your DataFrame, you can do the following:
df2 = pd.DataFrame({"Mega": my_series})
df = pd.concat([df, df2], axis=1)
print("\nDataFrame preview:")
print(df)
Output:
DataFrame preview:
RsWeeks RsMonths Mega
0 2015-11-10 2016-08-01 2015-11-10
1 2015-11-11 2015-12-30 2015-11-11
2 NaN NaN 2016-08-01
3 NaN NaN 2015-12-30
Data:
df = pd.DataFrame({"name" : 'Dav Las Oms'.split(),
'age' : [25, 50, 70]})
df['Name'] = list(['a', 'M', 'm'])
df:
name age Name
0 Dav 25 a
1 Las 50 M
2 Oms 70 m
df = pd.DataFrame(df.astype(str).apply('|'.join, axis=1))
df:
0
0 Dav|25|a
1 Las|50|M
2 Oms|70|m
You can use pd.melt() which makes your dataframe from wide to long:
df_reshaped = pd.melt(df, id_vars = ['id_1','id_2','id_3'], var_name = 'new_name', value_name = 'Mega')
Related
I'm trying to read data from https://download.bls.gov/pub/time.series/ee/ee.industry using pandas, like this:
import pandas as pd
url = 'https://download.bls.gov/pub/time.series/ee/ee.industry'
df = pd.read_csv(url, sep='\t')
Also tried getting the separator:
import pandas as pd
url = 'https://download.bls.gov/pub/time.series/ee/ee.industry'
reader = pd.read_csv(url, sep = None, iterator = True)
inferred_sep = reader._engine.data.dialect.delimiter
df = pd.read_csv(url, sep=inferred_sep)
However the data is not very weelk formated, the columns of the dataframe are right:
>>> df.columns
Index(['industry_code', 'SIC_code', 'publishing_status', 'industry_name'], dtype='object')
But the data does not correspond to the columns, it seems all the data is merged into the fisrt two columns and the last two do not have any data.
Any suggestion/idea on a better approach to fet this data?
EDIT
The expexted result should be something like:
industry_code
SIC_code
publishing_status
industry_name
000000
N/A
B
Total nonfarm 1 T 1
The reader works well but you don’t have the right number of columns in your header. You can get the other columns back using .reset_index() and then rename the columns:
>>> df = pd.read_csv(url, sep='\t')
>>> n_missing_headers = df.index.nlevels
>>> cols = df.columns.to_list() + [f'col{n}' for n in range(n_missing_headers)]
>>> df.reset_index(inplace=True)
>>> df.columns = cols
>>> df.head()
industry_code SIC_code publishing_status industry_name col0 col1 col2
0 0 NaN B Total nonfarm 1 T 1
1 5000 NaN A Total private 1 T 2
2 5100 NaN A Goods-producing 1 T 3
3 100000 10-14 A Mining 2 T 4
4 101000 10 A Metal mining 3 T 5
You can then keep the first 4 columns if you want:
>>> df.iloc[:, :-n_missing_headers].head()
industry_code SIC_code publishing_status industry_name
0 0 NaN B Total nonfarm
1 5000 NaN A Total private
2 5100 NaN A Goods-producing
3 100000 10-14 A Mining
4 101000 10 A Metal mining
How do I add a merge columns of Pandas dataframe to another dataframe while the new columns of data has less rows? Specifically I need to new column of data to be filled with NaN at the first few rows in the merged DataFrame instead of the last few rows. Please refer to the picture. Thanks.
Use:
df1 = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
})
df2 = pd.DataFrame({
'SMA':list('rty')
})
df3 = df1.join(df2.set_index(df1.index[-len(df2):]))
Or:
df3 = pd.concat([df1, df2.set_index(df1.index[-len(df2):])], axis=1)
print (df3)
A B SMA
0 a 4 NaN
1 b 5 NaN
2 c 4 NaN
3 d 5 r
4 e 5 t
5 f 4 y
How it working:
First is selected index in df1 by length of df2 from back:
print (df1.index[-len(df2):])
RangeIndex(start=3, stop=6, step=1)
And then is overwrite existing values by DataFrame.set_index:
print (df2.set_index(df1.index[-len(df2):]))
SMA
3 r
4 t
5 y
I am facing a problem that I am uncapable of finding a way around it.
I find very difficult too to explain what I am trying to do so hopefully a small example would help
I have df1 as such:
Id product_1 product_2
Date
1 0.1855672 0.8855672
2 0.1356667 0.0356667
3 1.1336686 1.7336686
4 0.9566671 0.6566671
and I have df2 as such:
product_1 Month
Date
2018-03-30 11.0 3
2018-04-30 18.0 4
2019-01-29 14.0 1
2019-02-28 22.0 2
and what I am trying to achieve is this in df2:
product_1 Month seasonal_index
Date
2018-03-30 11.0 3 1.1336686
2018-04-30 18.0 4 0.9566671
2019-01-29 14.0 1 0.1855672
2019-02-28 22.0 2 0.1356667
So what I try is to match the product name in df2 with the corresponding column in d1 and then get the value of for each index value that matches the month number in df2
I have tried doing things like:
for i in df1:
df2['seasonal_index'] = df1.loc[df1.iloc[:,i] == df2['Month']]
but with no success. Hopefully someone could have a clue on how to unblock the situation
Here you are my friend, this produces exactly the output you specified.
import pandas as pd
# replicate df1
data1 = [[0.1855672, 0.8855672],
[0.1356667, 0.0356667],
[1.1336686, 1.7336686],
[0.9566671, 0.6566671]]
index1 = [1, 2, 3, 4]
df = pd.DataFrame(data=data1,
index= index1,
columns=['product_1', 'product_2'])
df.columns.name = 'Id'
df.index.name = 'Date'
# replicate df2
data2 = [[11.0, 3],
[18.0, 4],
[14.0, 1],
[22.0, 2]]
index2 = [pd.Timestamp('2018-03-30'),
pd.Timestamp('2018-04-30'),
pd.Timestamp('2019-01-29'),
pd.Timestamp('2019-02-28')]
df2 = pd.DataFrame(data=data2, index=index2,
columns=['product_1', 'Month'])
df2.index.name = 'Date'
# Merge your data
df3 = pd.merge(left=df2, right=df[['product_1']],
left_on='Month',
right_index=True,
how='outer',
suffixes=('', '_df2'))
df3 = df3.rename(columns={'product_1_df2': 'seasonal_index'})
print(df3)
If you are interested in learning why this works, take a look at this link explaining the pandas.merge function. Notice specifically that for your dataframes, the key for df2 is one of its columns (so we use the left_on parameter in pd.merge) and the key for df is its index (so we use the right_index parameter in pd.merge).
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html
I have two dataframes
codes are below for the two dfs
import pandas as pd
df1 = pd.DataFrame({'income1': [-13036.0, 1200.0, -12077.5, 1100.0],
'income2': [-30360.0, 2000.0, -2277.5, 1500.0],
})
df2 = pd.DataFrame({'name1': ['abc', 'deb', 'hghg', 'gfgf'],
'name2': ['dfd', 'dfd1', 'df3df', 'fggfg'],
})
I want to combine the 2 dfs to get a single df with names against the respective income values, as shown below. Any help is appreciated. Please note that I want it in the same sequence as shown in my output.
Here is possible convert values to numpy array and flatten with pass to DataFrame cosntructor:
df = pd.DataFrame({'Name': np.ravel(df2.to_numpy()),
'Income': np.ravel(df1.to_numpy())})
print (df)
Name Income
0 abc -13036.0
1 dfd -30360.0
2 deb 1200.0
3 dfd1 2000.0
4 hghg -12077.5
5 df3df -2277.5
6 gfgf 1100.0
7 fggfg 1500.0
Or concat with DataFrame.stack and Series.reset_index for default index values:
df = pd.concat([df2.stack().reset_index(drop=True),
df1.stack().reset_index(drop=True)],axis=1, keys=['Name','Income'])
print (df)
Name Income
0 abc -13036.0
1 dfd -30360.0
2 deb 1200.0
3 dfd1 2000.0
4 hghg -12077.5
5 df3df -2277.5
6 gfgf 1100.0
7 fggfg 1500.0
Try this:
incomes = pd.concat([df1.income1, df1.income2], axis = 0)
names = pd.concat([df2.name1 , df2.name2] , axis = 0)
df = pd.DataFrame({'Name': names, 'Incomes': incomes})
Is there a better way than bdate_range() to measure business days between two columns of dates via pandas?
df = pd.DataFrame({ 'A' : ['1/1/2013', '2/2/2013', '3/3/2013'],
'B': ['1/12/2013', '4/4/2013', '3/3/2013']})
print df
df['A'] = pd.to_datetime(df['A'])
df['B'] = pd.to_datetime(df['B'])
f = lambda x: len(pd.bdate_range(x['A'], x['B']))
df['DIFF'] = df.apply(f, axis=1)
print df
With output of:
A B
0 1/1/2013 1/12/2013
1 2/2/2013 4/4/2013
2 3/3/2013 3/3/2013
A B DIFF
0 2013-01-01 00:00:00 2013-01-12 00:00:00 9
1 2013-02-02 00:00:00 2013-04-04 00:00:00 44
2 2013-03-03 00:00:00 2013-03-03 00:00:00 0
Thanks!
brian_the_bungler was onto the most efficient way of doing this using numpy's busday_count:
import numpy as np
A = [d.date() for d in df['A']]
B = [d.date() for d in df['B']]
df['DIFF'] = np.busday_count(A, B)
print df
On my machine this is 300x faster on your test case, and 1000s of times faster on much larger arrays of dates
You can use pandas' Bday offset to step through business days between two dates like this:
new_column = some_date - pd.tseries.offsets.Bday(15)
Read more in this conversation: https://stackoverflow.com/a/44288696
It also works if some_date is a single date value, not a series.