How to join all columns in dataframe? [duplicate] - python

This question already has answers here:
Pandas: Multiple columns into one column
(4 answers)
How to stack/append all columns into one column in Pandas? [duplicate]
(4 answers)
Closed 10 months ago.
I would like one column to have all the other columns in the data frame combined.
here is what the dataframe looks like
0 1 2
0 123 321 231
1 232 321 231
2 432 432 432
dataframe name = task_ba
I would like it to look like this
0
0 123
1 232
2 432
3 321
4 321
5 432
6 231
7 231
8 432

Easiest and fastest option, use the underlying numpy array:
df2 = pd.DataFrame(df.values.ravel(order='F'))
NB. If you prefer a series, use pd.Series instead
Output:
0
0 123
1 232
2 432
3 321
4 321
5 432
6 231
7 231
8 432

You can use pd.DataFrame.melt() and then drop the variable column:
>>> df
0 1 2
0 123 321 231
1 232 321 231
2 432 432 432
>>> df.melt().drop("variable", axis=1) # Drops the 'variable' column
value
0 123
1 232
2 432
3 321
4 321
5 432
6 231
7 231
8 432
Or if you want 0 as your column name:
>>> df.melt(value_name=0).drop("variable", axis=1)
0
0 123
1 232
2 432
3 321
4 321
5 432
6 231
7 231
8 432
You can learn all this (and more!) in the official documentation.

Related

Get max calls by a person Pandas Python

Let's say, I have number A and they call several people B
A B
123 987
123 987
123 124
435 567
435 789
653 876
653 876
999 654
999 654
999 654
999 123
I want to find to whom the person in A has called maximum times and also the number of times.
OUTPUT:
A B Count
123 987 2
435 567 or789 1
653 876 2
999 654 3
How one can think of it is,
A B
123 987 2
124 1
435 567 1
789 1
653 876 2
999 654 3
123 1
Can somebody help me out on how to do this?
Try this
# count the unique values in rows
df.value_counts(['A','B']).sort_index()
A B
123 124 1
987 2
435 567 1
789 1
653 876 2
999 123 1
654 3
dtype: int64
To get the highest values for each unique A:
v = df.value_counts(['A','B'])
# remove duplicated rows
v[~v.reset_index(level=0).duplicated('A').values]
A B
999 654 3
123 987 2
653 876 2
435 567 1
dtype: int64
Use SeriesGroupBy.value_counts which by default sorting values, so get first rows per A by GroupBy.head:
df = df.groupby('A')['B'].value_counts().groupby(level=0).head(1).reset_index(name='Count')
print (df)
A B Count
0 123 987 2
1 435 567 1
2 653 876 2
3 999 654 3
Another idea:
df = df.value_counts(['A','B']).reset_index(name='Count').drop_duplicates('A')
print (df)
A B Count
0 999 654 3
1 123 987 2
2 653 876 2
4 435 567 1

Pandas - how to sort week and year numbers formatted as strings?

I have a pandas dataframe like this, which sorted like:
>>> weekly_count.sort_values(by='date_in_weeks', inplace=True)
>>> weekly_count.loc[:9,:]
date_in_weeks count
0 1-2013 362
1 1-2014 378
2 1-2015 201
3 1-2016 294
4 1-2017 300
5 1-2018 297
6 10-2013 329
7 10-2014 314
8 10-2015 324
9 10-2016 322
in above data, first column, all rows of date_in_weeks are simply "week number of a year - year". I now want to sort it like this:
date_in_weeks count
0 1-2013 362
6 10-2013 329
1 1-2014 378
7 10-2014 314
2 1-2015 201
8 10-2015 324
3 1-2016 294
9 10-2016 322
4 1-2017 300
5 1-2018 297
How do i do this?
Use Series.argsort with converted to datetimes with format %W week number of the year, link:
df = df.iloc[pd.to_datetime(df['date_in_weeks'] + '-0',format='%W-%Y-%w').argsort()]
print (df)
date_in_weeks count
0 1-2013 362
6 10-2013 329
1 1-2014 378
7 10-2014 314
2 1-2015 201
8 10-2015 324
3 1-2016 294
9 10-2016 322
4 1-2017 300
5 1-2018 297
You can also convert to datetime , assign to the df, then sort the values and drop the extra col:
s = pd.to_datetime(df['date_in_weeks'],format='%M-%Y')
final = df.assign(dt=s).sort_values(['dt','count']).drop('dt',1)
print(final)
date_in_weeks count
0 1-2013 362
6 10-2013 329
1 1-2014 378
7 10-2014 314
2 1-2015 201
8 10-2015 324
3 1-2016 294
9 10-2016 322
4 1-2017 300
5 1-2018 297
You can try using auxiliary columns:
import pandas as pd
df = pd.DataFrame({'date_in_weeks':['1-2013','1-2014','1-2015','10-2013','10-2014'],
'count':[362,378,201,329,314]})
df['aux'] = df['date_in_weeks'].str.split('-')
df['aux_2'] = df['aux'].str.get(1).astype(int)
df['aux'] = df['aux'].str.get(0).astype(int)
df = df.sort_values(['aux_2','aux'],ascending=True)
df = df.drop(columns=['aux','aux_2'])
print(df)
Output:
date_in_weeks count
0 1-2013 362
3 10-2013 329
1 1-2014 378
4 10-2014 314
2 1-2015 201

Removing duplicates based on repeated column indices Python

I have a dataframe that has rows with repeated values in sequences.
For example:
df_raw
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14....
220 450 451 456 470 224 220 223 221 340 224 220 223 221 340.....
234 333 453 460 551 226 212 115 117 315 226 212 115 117 315.....
As you see the columns 0-6 are unique in this example and then we have repeated sequences [220 223 221 340 224] for row 1 from columns 6-10 and then again from 11-14.
This pattern is the same for row 2.
I'd like to remove the repeated sequences for each row of my dataframe (more than just 2) for an output like this:
df_clean
0 1 2 3 4 5 6 7 8 9.....
220 450 451 456 470 224 220 223 221 340.....
234 333 453 460 551 226 212 115 117 315.....
I trail with ...... because the columns are long and have multiple repeatitions for each row. I also cannot assume that each row has the exact same amount of repeated sequences nor that each sequence starts at the exact same index or ends at the same index.
Is there an easy way to do this with pandas or even a numpy array?

Read 4 lines of data into one row of pandas data frame

I have txt file with such values:
108,612,620,900
168,960,680,1248
312,264,768,564
516,1332,888,1596
I need to read all of this into a single row of data frame.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 108 612 620 900 168 960 680 1248 312 264 768 564 516 1332 888 1596
I have many such files and so I'll keep appending rows to this data frame.
I believe we need some kind of regex but I'm not able to figure it out. For now this is what I have :
df = pd.read_csv(f,sep=",| ", header = None)
But this takes , and (space) as separators where as I want it to take newline as a separator.
First, read the data:
df = pd.read_csv('test/t.txt', header=None)
It gives you a DataFrame shaped like the CSV. Then concatenate:
s = pd.concat((df.loc[i] for i in df.index), ignore_index=True)
It gives you a Series:
0 108
1 612
2 620
3 900
4 168
5 960
6 680
7 1248
8 312
9 264
10 768
11 564
12 516
13 1332
14 888
15 1596
dtype: int64
Finally, if you really want a horizontal DataFrame:
pd.DataFrame([s])
Gives you:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 108 612 620 900 168 960 680 1248 312 264 768 564 516 1332 888 1596
Since you've mentioned in a comment that you have many such files, you should simply store all the Series in a list, and construct a DataFrame with all of them at once when you're finished loading them all.

Python pandas pivot matplotlib

I created the following python pandas pivot table.
df_pv = pd.pivot_table(df,index=["Fiscal_Week"],columns=["Year"],values=["Category","Sales","Traffic"],
aggfunc={"Category":len,"Sales":np.sum,"Traffic":np.sum},fill_value=0)
Category Sales Traffic
Year |2014 2015 2016 | 2014 2015 2016 | 2014 2015 2016
Fiscal_Week
FW01 4 3 4 35678 654654 47547 567 231 765
FW02 2 6 7 6565 4686 34554 297 464 564
FW03 4 4 5 5867 56856 34346 287 45 324
FW04 2 5 3 8568 45745 3564 546 765 978
FW05 2 5 5 5685 3464 4754 325 235 654
FW06 4 3 2 56765 35663 3643 456 935 936
FW07 1 6 2 8686 2454 2463 324 728 598
FW08 6 2 3 34634 34543 4754 198 436 234
I would like to create the two following plots:
Scatterplot: Number of Campaigns by Sales and each year have it's own color.
The second graph should be Traffic by Fiscal Week.
I tried this unsucessfully
df_pv.plot(x="Fiscal_Week", y="Sales")
KeyError: 'Fiscal_Week'
Is there a better way - for example to not pivot, but within the graph specify the aggregations?
You're trying to use the index as a normal column. That's not possible.
Ways to overcome this:
Reset the index reset_index()
Use the index explicitely .plot(x=df_pv.index, y="Sales")
Use the index implicitely .plot(y="Sales", use_index=True)

Categories

Resources