Read 4 lines of data into one row of pandas data frame - python

I have txt file with such values:
108,612,620,900
168,960,680,1248
312,264,768,564
516,1332,888,1596
I need to read all of this into a single row of data frame.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 108 612 620 900 168 960 680 1248 312 264 768 564 516 1332 888 1596
I have many such files and so I'll keep appending rows to this data frame.
I believe we need some kind of regex but I'm not able to figure it out. For now this is what I have :
df = pd.read_csv(f,sep=",| ", header = None)
But this takes , and (space) as separators where as I want it to take newline as a separator.

First, read the data:
df = pd.read_csv('test/t.txt', header=None)
It gives you a DataFrame shaped like the CSV. Then concatenate:
s = pd.concat((df.loc[i] for i in df.index), ignore_index=True)
It gives you a Series:
0 108
1 612
2 620
3 900
4 168
5 960
6 680
7 1248
8 312
9 264
10 768
11 564
12 516
13 1332
14 888
15 1596
dtype: int64
Finally, if you really want a horizontal DataFrame:
pd.DataFrame([s])
Gives you:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 108 612 620 900 168 960 680 1248 312 264 768 564 516 1332 888 1596
Since you've mentioned in a comment that you have many such files, you should simply store all the Series in a list, and construct a DataFrame with all of them at once when you're finished loading them all.

Related

How to sort dataframe rows by multiple columns [duplicate]

This question already has answers here:
How to sort a dataFrame in python pandas by two or more columns?
(3 answers)
Closed last year.
I'm having trouble formatting a dataframe in a specific style. I want to have data pertaining to one S/N all clumped together. My ultimate goal with the dataset is to plot Dis vs Rate for all the S/Ns. I've tired iterating over rows to slice data but that hasnt worked. What would be the best(easiest) approach for this formatting. Thanks!
For example: S/N 332 has Dis 4.6 and Rate of 91.2 in the first row, immediately after that I want it to have S/N 332 with Dis 9.19 and Rate 76.2 and so on for all rows with S/N 332.
S/N Dis Rate
0 332 4.6030 91.204062
1 445 5.4280 60.233917
2 999 4.6030 91.474156
3 332 9.1985 76.212943
4 445 9.7345 31.902842
5 999 9.1985 76.212943
6 332 14.4405 77.664282
7 445 14.6015 36.261851
8 999 14.4405 77.664282
9 332 20.2005 76.725955
10 445 19.8630 40.705467
11 999 20.2005 76.725955
12 332 25.4780 31.597510
13 445 24.9050 4.897008
14 999 25.4780 31.597510
15 332 30.6670 74.096975
16 445 30.0550 35.217889
17 999 30.6670 74.096975
Edit: Tried using sort as #Ian Kenney suggested but that doesn't help because now the Dis values are no longer in the ascending order:
0 332 4.6030 91.204062
15 332 30.6670 74.096975
3 332 9.1985 76.212943
6 332 14.4405 77.664282
9 332 20.2005 76.725955
12 332 25.4780 31.597510
1 445 5.4280 60.233917
4 445 9.7345 31.902842
7 445 14.6015 36.261851
16 445 30.0550 35.217889
10 445 19.8630 40.705467
13 445 24.9050 4.897008
Use sort_values, which can accept a list of sorting targets. In this case it sounds like you want to sort by S/N, then Dis, then Rate:
df = df.sort_values(['S/N', 'Dis', 'Rate'])
# S/N Dis Rate
# 0 332 4.6030 91.204062
# 3 332 9.1985 76.212943
# 6 332 14.4405 77.664282
# 9 332 20.2005 76.725955
# 12 332 25.4780 31.597510
# 15 332 30.6670 74.096975
# 1 445 5.4280 60.233917
# 4 445 9.7345 31.902842
# 7 445 14.6015 36.261851
# 10 445 19.8630 40.705467
# 13 445 24.9050 4.897008
# 16 445 30.0550 35.217889
# ...
You can also achieve this by several ways, another way from the already existing answer is,
df.sort_values(by = ['S/N', "Dis", 'Rate'], inplace = True)
df
Output:
S/N Dis Rate
0 332 4.6030 91.204062
3 332 9.1985 76.212943
6 332 14.4405 77.664282
9 332 20.2005 76.725955
12 332 25.4780 31.597510
15 332 30.6670 74.096975
1 445 5.4280 60.233917
4 445 9.7345 31.902842
7 445 14.6015 36.261851
10 445 19.8630 40.705467
13 445 24.9050 4.897008
16 445 30.0550 35.217889
2 999 4.6030 91.474156
5 999 9.1985 76.212943
8 999 14.4405 77.664282
11 999 20.2005 76.725955
14 999 25.4780 31.597510
17 999 30.6670 74.096975
Here, the Inplace argument used within the sort_values function directly make the changes in the source dataframe which will eliminate the need to create another dataframe to store the sorted output.

Replace -1 in pandas series with unique values

I have a pandas series that can have positive integers (0, 8, 10, etc) and -1s:
id values
1137 -1
1097 -1
201 8
610 -1
594 -1
727 -1
970 21
300 -1
243 0
715 -1
946 -1
548 4
Name: cluster, dtype: int64
I want to replace those -1 with values that don't already exist in the series and that are unique between them, in other words, I can't fill twice with, for example, 90. What's the most pythonic way to do that?
Here is the expected output:
id values
1137 1
1097 2
201 8
610 3
594 5
727 6
970 21
300 7
243 0
715 9
946 10
548 4
Name: cluster, dtype: int64
Idea is create all possible values by np.arange with add more values for positives, then get difference with positives and set to filtered column:
m = df['values'] != -1
s = np.setdiff1d(np.arange(len(df) + m.sum()), df.loc[m, 'values'])
df.loc[~m, 'values'] = s[:(~m).sum()]
print (df)
id values
0 1137 1
1 1097 2
2 201 8
3 610 3
4 594 5
5 727 6
6 970 21
7 300 7
8 243 0
9 715 9
10 946 10
11 548 4

Removing duplicates based on repeated column indices Python

I have a dataframe that has rows with repeated values in sequences.
For example:
df_raw
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14....
220 450 451 456 470 224 220 223 221 340 224 220 223 221 340.....
234 333 453 460 551 226 212 115 117 315 226 212 115 117 315.....
As you see the columns 0-6 are unique in this example and then we have repeated sequences [220 223 221 340 224] for row 1 from columns 6-10 and then again from 11-14.
This pattern is the same for row 2.
I'd like to remove the repeated sequences for each row of my dataframe (more than just 2) for an output like this:
df_clean
0 1 2 3 4 5 6 7 8 9.....
220 450 451 456 470 224 220 223 221 340.....
234 333 453 460 551 226 212 115 117 315.....
I trail with ...... because the columns are long and have multiple repeatitions for each row. I also cannot assume that each row has the exact same amount of repeated sequences nor that each sequence starts at the exact same index or ends at the same index.
Is there an easy way to do this with pandas or even a numpy array?

Displaying columns from csv to pandas

I've managed to display the columns from the csv to pandas on Python 3. However, the columns are being separated to 3 lines. Is it possible to squeeze all the columns onto a single line? This was done on jupyter notebook.
import pandas as pd
import numpy as np
raw = pd.read_csv("D:/Python/vitamin.csv")
print(raw.head())
Result
RowID Gender BMI Energy_Actual VitaminA_Actual VitaminC_Actual \
0 1 F 18.0 1330 206 15
1 2 F 25.0 1792 469 59
2 3 F 21.6 1211 317 18
3 4 F 23.9 1072 654 24
4 5 F 24.3 1534 946 118
Calcium_Actual Iron_Actual Energy_DRI VitaminA_DRI VitaminC_DRI \
0 827 22 1604 700 65
1 900 12 2011 700 65
2 707 7 2242 700 75
3 560 11 1912 700 75
4 851 12 1895 700 65
Calcium_DRI Iron_DRI
0 1300 15
1 1300 15
2 1000 8
3 1000 18
4 1300 15
You should use below code in the beginning, please refer pandas.set_option:
pd.set_option('display.expand_frame_repr', False)

Selectively remove deprecated rows in a pandas dataframe

I have a Dataframe containing data that looks like below.
p,g,a,s,v
15,196,1399,16,5
15,196,948,5,1
15,196,1894,5,1
15,196,1616,5,1
15,196,1742,3,1
15,196,1742,4,4
15,196,1742,5,1
15,195,732,9,2
15,195,1765,11,7
15,196,1815,9,1
15,196,1399,11,8
15,196,1958,0,1
15,195,767,9,1
15,195,1765,11,8
15,195,886,9,1
15,195,1765,11,9
15,196,1958,5,1
15,196,1697,1,1
15,196,1697,4,1
Given multiple entries that have the same p, g, a, and s, I need to drop all but the one with the highest v. The reason is that the original source of this data is a kind of event log, and each line corresponds to a "new total". If it matters, the source data is ordered by time and includes a timestamp index, which I removed for brevity. The entry with the latest date would be the same as the entry with the highest v, as v only increases.
Pulling an example out of the above data, given this:
p,g,a,s,v
15,195,1765,11,7
15,195,1765,11,8
15,195,1765,11,9
I need to drop the first two rows and keep the last one.
If I understand correctly I think you want the following, this performs a groupby on your cols of interest and then takes the max value of column 'v' and we then call reset_index:
In [103]:
df.groupby(['p', 'g', 'a', 's'])['v'].max().reset_index()
Out[103]:
p g a s v
0 15 195 732 9 2
1 15 195 767 9 1
2 15 195 886 9 1
3 15 195 1765 11 9
4 15 196 948 5 1
5 15 196 1399 11 8
6 15 196 1399 16 5
7 15 196 1616 5 1
8 15 196 1697 1 1
9 15 196 1697 4 1
10 15 196 1742 3 1
11 15 196 1742 4 4
12 15 196 1742 5 1
13 15 196 1815 9 1
14 15 196 1894 5 1
15 15 196 1958 0 1
16 15 196 1958 5 1

Categories

Resources