How to stack columns/series vertically in pandas - python

I currently have a dataframe that looks something like this
entry_1 entry_2
2022-01-21 2022-02-01
2022-03-23 NaT
2022-04-13 2022-06-06
however I need to vertically stack my two columns to get something like this
entry
2022-01-21
2022-03-23
2022-04-13
2022-02-01
NaT
2022-06-06
I've tried using df['entry'] = df['entry_1].append(df['entry_2']).reset_index(drop=True) to no success

I recommend that you use
pd.DataFrame(df.values.ravel(), columns=['all_entries'])
This will allow you to return the flattened underlying data as an ndarray. By wrapping that in pd.Dataframe() you will convert it back to a dataframe with the column name "all_entries"
For more information please visit the pandas doc: https://pandas.pydata.org/docs/reference/api/pandas.Series.ravel.html

You can use concat of series and get the result in a dataframe like:
df['entry_1'] = pd.to_datetime(df['entry_1'])
df['entry_2'] = pd.to_datetime(df['entry_2'])
df_result = pd.DataFrame({
'entry':pd.concat([df['entry_1'], df['entry_2']], ignore_index=True)
})
or
entry_cols = ['entry_1', 'entry_2']
df_result = pd.DataFrame({
'entry':pd.concat([df[col] for col in entry_cols], ignore_index=True)
})
print(df_result)
entry
0 2022-02-21
1 2022-02-23
2 2022-04-13
3 2022-02-01
4 NaT
5 2022-06-06

Related

Substract one datetime column after a groupby with a time reference for each group from a second Pandas dataframe

I have one dataframe df1 with one admissiontime for each id.
id admissiontime
1 2117-04-03 19:15:00
2 2117-10-18 22:35:00
3 2163-10-17 19:15:00
4 2149-01-08 15:30:00
5 2144-06-06 16:15:00
And an another dataframe df2 with several datetame for each id
id datetime
1 2135-07-28 07:50:00.000
1 2135-07-28 07:50:00.000
2 2135-07-28 07:57:15.900
3 2135-07-28 07:57:15.900
3 2135-07-28 07:57:15.900
I would like to substract for each id, datetimes with his specific admissiontime, in a column of the second dataframe.
I think I have to use d2.group.by('id')['datetime']- something but I struggle to connect with the df1.
Use Series.sub with mapping by Series.map by another DataFrame:
df1['admissiontime'] = pd.to_datetime(df1['admissiontime'])
df2['datetime'] = pd.to_datetime(df2['datetime'])
df2['diff'] = df2['datetime'].sub(df2['id'].map(df1.set_index('id')['admissiontime']))

How do I delete specific dataframe rows based on a columns value?

I have a pandas dataframe with 2 columns ("Date" and "Gross Margin). I want to delete rows based on what the value in the "Date" column is. This is my dataframe:
Date Gross Margin
0 2021-03-31 44.79%
1 2020-12-31 44.53%
2 2020-09-30 44.47%
3 2020-06-30 44.36%
4 2020-03-31 43.69%
.. ... ...
57 2006-12-31 49.65%
58 2006-09-30 52.56%
59 2006-06-30 49.86%
60 2006-03-31 46.20%
61 2005-12-31 40.88%
I want to delete every row where the "Date" value doesn't end with "12-31". I read some similar posts on this and the pandas.drop() function seemed to be the solution, but I haven't figured out how to use it for this specific case.
Please leave any suggestions as to what I should do.
you can try the following code, where you match the day and month.
df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m-%d')
df = df[df['Date'].dt.strftime('%m-%d') == '12-31']
Assuming you have the date formatted as year-month-day
df = df[~df['Date'].str.endswith('12-31')]
If the dates are using a consistent format, you can do it like this:
df = df[df['Date'].str.contains("12-31", regex=False)]

How to reshape dataframe by extracting partial name of header in python?

I am not sure how to reform a data frame with the certain columns based on part of header's name.
Here is data frame I've got.
Date 990986_125001_AA1234 990986_125002_AB2586 990986_125003_AA1234
2020-01-01 439.9 398.9 435.8
2020-05-25 443.8 390.9 438.8
2020-09-11 438.9 387.9 436.8
2020-03-27 435.2 399.2 431.5
2020-07-30 434.6 387.2 422.5
2020-08-05 432.7 377.1 432.7
I want to form a three separate data frame based on the header.
for example df1 should only contains columns starting with 990986_125001_******
df2 should only contains columns starting with 990986_125002_******
df3 should only contains columns starting with 990986_125003_******
The sepearator is middle number (12500*) so df1 ends with 1 and df2 ends with 2 and df3 ends with 3.
I have 100 of columns.
The desired output will be
df1
Date 990986_125001_AA1234
2020-01-01 439.9
2020-05-25 443.8
2020-09-11 438.9
2020-03-27 435.2
2020-07-30 434.6
2020-08-05 432.7
second dataframe
df2
Date 990986_125002_AB2586
2020-01-01 398.9
2020-05-25 390.9
2020-09-11 387.9
2020-03-27 399.2
2020-07-30 387.2
2020-08-05 377.1
third data frame
df3
Date 990986_125003_AA1234
2020-01-01 435.8
2020-05-25 438.8
2020-09-11 436.8
2020-03-27 431.5
2020-07-30 422.5
2020-08-05 432.7
I have searched in google and stack overflow but they only showed me to reshape the columns with calling header's name or index or iloc.
can someone please help me to reshape the data frame with satisfy condition.
Thanks
import pandas as pd
dict1 = {}
df = pd.read_csv("data.csv")
for index, item in enumerate(df.columns.tolist()[1:]):
if "1250" in item.split("_")[1]:
dict1["df"+str(index)] = df[['Date', item]]
for keys in dict1.keys():
print(keys)
print(dict1[keys])
This will create your desired answer as per the given test data.

Any way to flag bad lines in pandas when reading an excel file?

pandas.read_csv has (warn, error) bad lines methods. I can't see any for pandas.read_excel. Is there a reason? For example, if I wanted to read an excel file where a column is supposed to be a datetime and the pandas.read_execl function encounters an int or str in one/few of the rows. Do i need to handle this myself?
In short, no I do not believe there is a way to do automatically do this with a parameter you pass to read_excel(). This is how to solve your problem though:
Let's say that when you read in your dataframe it looks like this:
df = pd.read_excel('Desktop/Book1.xlsx')
df
Date
0 2020-09-13 00:00:00
1 2
2 abc
3 2020-09-14 00:00:00
You can you pass errors='coerce' to pd.to_datetime():
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
df
Date
0 2020-09-13
1 NaT
2 NaT
3 2020-09-14
Finally, you can drop those rows with:
df = df[df['Date'].notnull()]
df
Date
0 2020-09-13
3 2020-09-14

Pandas - finding the row with least value of one of the levels of a multiindex

So, I have a DataFrame with a multiindex which looks like this:
info1 info2 info3
abc-8182 2012-05-08 10:00:00 1 6.0 "yeah!"
2012-05-08 10:01:00 2 25.0 ":("
pli-9230 2012-05-08 11:00:00 1 30.0 "see yah!"
2012-05-08 11:15:00 1 30.0 "see yah!"
...
The index is an id and a datetime representing when that info about that id was recorded. What we needed to do was to find, for each id, the earliest record. We tried a lot of options from the dataframe methods but we ended up doing it by looping through the DataFrame:
df = pandas.read_csv(...)
empty = pandas.DataFrame()
ids = df.index.get_level_values(0)
for id in ids:
minDate = df.xs(id).index.min()
row = df.xs(id).xs(minDate)
mindf = pandas.DataFrame(row).transpose()
mindf.index = pandas.MultiIndex.from_tuples([(id, mindate)])
empty = empty.append(mindf)
print empty.groupby(lambda x : x).first()
Which gives me:
x0 x1 x2
('abc-8182', <Timestamp: 2012-05-08 10:00:00>) 1 6 yeah!
('pli-9230', <Timestamp: 2012-05-08 11:00:00>) 1 30 see yah!
I feel that there must be a simple, "pandas idiomatic", very immediate way to do this without looping though the data frame like this. Is there? :)
Thanks.
To get the first item in each group, you can do:
df.reset_index(level=1).groupby(level=0).first()
which will drop the datetime field to a column before the groups are grouped by groupby, therefore it will remain in the dataframe in the result.
If you need to ensure the earliest time is kept, you can sort, before you call first:
df.reset_index(level=1).sort_index(by="datetime").groupby(level=0).first()

Categories

Resources