text file rows into CSV column python - python

I've a question
I've a text file containing data like this
A 34 45 7789 3475768 443 67 8999 3343 656 8876 802 383358 873 36789 2374859 485994 86960 32838459 3484549 24549 58423
T 3445 574649 68078 59348604 45959 64585304 56568 595 49686 656564 55446 665 677 778 433 545 333 65665 3535
and so on
I want to make a csv file from this text file, displaying data like this, A & T as column headings, and then numbers
A T
34 3445
45 574649
7789 68078
3475768 59348604
443 45959

EDIT (A lot simpler solution inspired by Michael Butscher's comment):
import pandas as pd
df = pd.read_csv("filename.txt", delimiter=" ")
df.T.to_csv("filename.csv", header=False)
Here is the code:
import pandas as pd
# Read file
with open("filename.txt", "r") as f:
data = f.read()
# Split data by lines and remove empty lines
columns = data.split("\n")
columns = [x.split() for x in columns if x!=""]
# Row sizes are different in your example so find max number of rows
column_lengths = [len(x) for x in columns]
max_col_length = max(column_lengths)
data = {}
for i in columns:
# Add None to end for columns that have less values
if len(i)<max_col_length:
i += [None]*(max_col_length-len(i))
data[i[0]] = i[1:]
# Create dataframe
df = pd.DataFrame(data)
# Create csv
df.to_csv("filename.csv", index=False)
Output should look like this:
A T
0 34 3445
1 45 574649
2 7789 68078
3 3475768 59348604
4 443 45959
5 67 64585304
6 8999 56568
7 3343 595
8 656 49686
9 8876 656564
10 802 55446
11 383358 665
12 873 677
13 36789 778
14 2374859 433
15 485994 545
16 86960 333
17 32838459 65665
18 3484549 3535
19 24549 None
20 58423 None

here is my code
import pandas as pd
data = pd.read_csv("text (3).txt", header = None)
Our_Data = pd.DataFrame(data)
for rows in Our_Data:
New_Data=pd.DataFrame(Our_Data[rows].str.split(' ').tolist()).T
New_Data.columns = New_Data.iloc[0]
New_Data = New_Data[1:]
New_Data.to_csv("filename.csv", index=False)
The Output
A T
1 34 3445
2 45 574649
3 7789 68078
4 3475768 59348604
5 443 45959
6 67 64585304
7 8999 56568
8 3343 595
9 656 49686
10 8876 656564
11 802 55446
12 383358 665
13 873 677
14 36789 778
15 2374859 433
16 485994 545
17 86960 333
18 32838459 65665
19 3484549 3535
20 24549 None
21 58423 None

Related

Plot moving average with data [duplicate]

This question already has answers here:
Moving Average Pandas
(4 answers)
Closed 2 years ago.
I am trying to calculate and plot moving average along with the data it is calculated from:
def movingAvg(df):
window_size = 7
i = 0
moving_averages = []
while i < len(df) - window_size + 1:
current_window = df[i : i + window_size]
window_average = current_window.mean()
moving_averages.append(window_average)
i += 1
return moving_averages
dates = df_valid['dateTime']
startDay = dates.iloc[0]
lastDay = dates.iloc[-1]
fig, ax = plt.subplots(figsize=(20, 10))
ax.autoscale()
#plt.xlim(startDay, lastDay)
df_valid.sedentaryActivityMins.reset_index(drop=True, inplace=True)
df_moving = pd.DataFrame(movingAvg(df_valid['sedentaryActivityMins']))
df_nan = [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan]
df_nan = pd.DataFrame(df_nan)
df_moving = pd.concat([df_nan, df_moving])
plt.plot(df_valid.sedentaryActivityMins)
plt.plot(df_moving)
#plt.show()
But as the moving average uses 7 windows, the list of moving averages is 7 items short, and therefore the plots do not follow each other correctly.
I tried putting 7 "NaN" into the moving average list, but those are ignored when I plot.
The plot is as follows:
But I would like the the orange line to start 7 steps ahead.
So it looks like this:
df_valid.sedentaryActivityMins.head(40)
0 608
1 494
2 579
3 586
4 404
5 750
6 573
7 466
8 389
9 604
10 351
11 553
12 768
13 572
14 616
15 522
16 675
17 607
18 229
19 529
20 746
21 646
22 625
23 590
24 572
25 462
26 708
27 662
28 649
29 626
30 485
31 509
32 561
33 664
34 517
35 587
36 602
37 601
38 495
39 352
Name: sedentaryActivityMins, dtype: int64
Any ideas as to how?
Thanks in advance!
When you do a concat, the indexes don't change. The NaNs will also take the same indices as the first 7 observations of your series. So either do a reset index after the concat or set ignore_index as True as follows:
df_moving = pd.concat([df_nan, df_moving],ignore_index=True)
plt.plot(x)
plt.plot(df_moving)
This gives the output as expected:

How can I loop though pandas groupby and manipulate data?

I am trying to work out the time delta between values in a grouped pandas df.
My df looks like this:
Location ID Item Qty Time
0 7 202545942 100130 1 07:19:46
1 8 202545943 100130 1 07:20:08
2 11 202545950 100130 1 07:20:31
3 13 202545955 100130 1 07:21:08
4 15 202545958 100130 1 07:21:18
5 18 202545963 100130 3 07:21:53
6 217 202546320 100130 1 07:22:43
7 219 202546324 100130 1 07:22:54
8 229 202546351 100130 1 07:23:32
9 246 202546376 100130 1 07:24:09
10 273 202546438 100130 1 07:24:37
11 286 202546464 100130 1 07:24:59
12 296 202546490 100130 1 07:25:16
13 297 202546491 100130 1 07:25:24
14 310 202546516 100130 1 07:25:59
15 321 202546538 100130 1 07:26:17
16 329 202546549 100130 1 07:28:09
17 388 202546669 100130 1 07:29:02
18 420 202546717 100130 2 07:30:01
19 451 202546766 100130 1 07:30:19
20 456 202546773 100130 1 07:30:27
(...)
42688 458 202546777 999969 1 06:51:16
42689 509 202546884 999969 1 06:53:09
42690 567 202546977 999969 1 06:54:21
42691 656 202547104 999969 1 06:57:27
I have grouped this using the following method:
ndf = df.groupby(['ID','Location','Time'])
If I add .size() to the end of the above and print(ndf) I get the following output:
(...)
ID Location Time
995812 696 07:10:36 1
730 07:11:41 1
761 07:12:30 1
771 07:20:49 1
995820 381 06:55:07 1
761 07:12:44 1
(...)
This is the as desired.
My challenge is that I need to work out the time delta between each time per Item and add this as a column in the dataframe grouping. It should give me the following:
ID Location Time Delta
(...)
995812 696 07:10:36 0
730 07:11:41 00:01:05
761 07:12:30 00:00:49
771 07:20:49 00:08:19
995820 381 06:55:07 0
761 07:12:44 00:17:37
(...)
I am pulling my hair out trying to work out a method of doing this, so I'm turning to the greats.
Please help. Thanks in advance.
Convert Time column to timedeltas by to_timedelta, sort by all 3 columns by DataFrame.sort_values, get difference per groups by DataFrameGroupBy.diff, replace missing values to 0 timedelta by Series.fillna:
#if strings astype should be omit
df['Time'] = pd.to_timedelta(df['Time'].astype(str))
df = df.sort_values(['ID','Location','Time'])
df['Delta'] = df.groupby('ID')['Time'].diff().fillna(pd.Timedelta(0))
Also is possible convert timedeltas to seconds - add Series.dt.total_seconds:
df['Delta_sec'] = df.groupby('ID')['Time'].diff().dt.total_seconds().fillna(0)
If you just wanted to iterate over the groupby object, based on your original question title you can do it:
for (x, y) in df.groupby(['ID','Location','Time']):
print("{0}, {1}".format(x, y))
# your logic
However, this works for 10.000 rows, 100.000 rows, but not so good for 10^6 rows or more.

How to apply a function to all the columns in a data frame and take output in the form of dataframe in python

I have two functions that do some calculation and gives me results. For now, I am able to apply it in one column and get the result in the form of a dataframe.
I need to know how I can apply the function on all the columns in the dataframe and get results as well in the form of a dataframe.
Say I have a data frame as below and I need to apply the function on each column in the data frame and get a dataframe with results corresponding for all the columns.
A B C D E F
1456 6744 9876 374 65413 1456
654 2314 674654 2156 872 6744
875 653 36541 345 4963 9876
6875 7401 3654 465 3547 374
78654 8662 35 6987 6874 65413
658 94512 687 489 8756 5854
Results
A B C D E F
2110 9058 684530 2530 66285 8200
1529 2967 711195 2501 5835 16620
7750 8054 40195 810 8510 10250
85529 16063 3689 7452 10421 65787
Here is simple example
df
A B C D
0 10 11 12 13
1 20 21 22 23
2 30 31 32 33
3 40 41 42 43
# Assume your user defined function is
def mul(x, y):
return x * y
which will multiply the values
Let's say you want to multiply first column 'A' with 3
df['A'].apply(lambda x: mul(x,3))
0 30
1 60
2 90
3 120
Now, you want to apply mul function to all columns of dataframe and create new dataframe with results
df1 = df.applymap(lambda x: mul(x, 3))
df1
A B C D
0 30 33 36 39
1 60 63 66 69
2 90 93 96 99
3 120 123 126 129
pd.DataFrame object also has its own apply method.
From the example given in the documentation of the link above:
>>> df = pd.DataFrame([[4, 9],] * 3, columns=['A', 'B'])
>>> df
A B
0 4 9
1 4 9
2 4 9
>>> df.apply(np.sqrt)
A B
0 2.0 3.0
1 2.0 3.0
2 2.0 3.0
Conclusion: you should be able to apply your function to the whole dataframe.
It looks like this is what you are trying to do in your output:
df = pd.DataFrame(
[[1456, 6744, 9876, 374, 65413, 1456],
[654, 2314, 674654, 2156, 872, 6744],
[875, 653, 36541, 345, 4963, 9876],
[6875, 7401, 3654, 465, 3547, 374],
[78654, 8662, 35, 6987, 6874, 65413],
[658, 94512, 687, 489, 8756, 5854]],
columns=list('ABCDEF'))
def fn(col):
return col[:-2].values + col[1:-1].values
Apply the function as mentioned in previous answers:
>>> df.apply(fn)
A B C D E F
0 2110 9058 684530 2530 66285 8200
1 1529 2967 711195 2501 5835 16620
2 7750 8054 40195 810 8510 10250
3 85529 16063 3689 7452 10421 65787

Python pdist: Setting an array element with a sequence

I have written the following code
arr_coord = []
for chains in structure:
for chain in chains:
for residue in chain:
for atom in residue:
x = atom.get_coord()
arr_coord.append({'X': [x[0]],'Y':[x[1]],'Z':[x[2]]})
coord_table = pd.DataFrame(arr_coord)
print(coord_table)
To generate the following dataframe
X Y Z
0 [-5.43] [28.077] [-0.842]
1 [-3.183] [26.472] [1.741]
2 [-2.574] [22.752] [1.69]
3 [-1.743] [21.321] [5.121]
4 [0.413] [18.212] [5.392]
5 [0.714] [15.803] [8.332]
6 [4.078] [15.689] [10.138]
7 [5.192] [12.2] [9.065]
8 [4.088] [12.79] [5.475]
9 [5.875] [16.117] [4.945]
10 [8.514] [15.909] [2.22]
11 [12.235] [15.85] [2.943]
12 [13.079] [16.427] [-0.719]
13 [10.832] [19.066] [-2.324]
14 [12.327] [22.569] [-2.163]
15 [8.976] [24.342] [-1.742]
16 [7.689] [25.565] [1.689]
17 [5.174] [23.336] [3.467]
18 [2.339] [24.135] [5.889]
19 [0.9] [22.203] [8.827]
20 [-1.217] [22.065] [11.975]
21 [0.334] [20.465] [15.09]
22 [0.0] [20.066] [18.885]
23 [2.738] [21.762] [20.915]
24 [4.087] [19.615] [23.742]
25 [7.186] [21.618] [24.704]
26 [8.867] [24.914] [23.91]
27 [11.679] [27.173] [24.946]
28 [10.76] [30.763] [25.731]
29 [11.517] [33.056] [22.764]
.. ... ... ...
431 [8.093] [34.654] [68.474]
432 [7.171] [32.741] [65.298]
433 [5.088] [35.626] [63.932]
434 [7.859] [38.22] [64.329]
435 [10.623] [35.908] [63.1]
436 [12.253] [36.776] [59.767]
437 [10.65] [35.048] [56.795]
438 [7.459] [34.084] [58.628]
439 [4.399] [35.164] [56.713]
440 [0.694] [35.273] [57.347]
441 [-1.906] [34.388] [54.667]
442 [-5.139] [35.863] [55.987]
443 [-8.663] [36.808] [55.097]
444 [-9.629] [40.233] [56.493]
445 [-12.886] [42.15] [56.888]
446 [-12.969] [45.937] [56.576]
447 [-14.759] [47.638] [59.485]
448 [-14.836] [51.367] [60.099]
449 [-11.607] [51.863] [58.176]
450 [-9.836] [48.934] [59.829]
451 [-8.95] [45.445] [58.689]
452 [-9.824] [42.599] [61.073]
453 [-8.559] [39.047] [60.598]
454 [-11.201] [36.341] [60.195]
455 [-11.561] [32.71] [59.077]
456 [-7.786] [32.216] [59.387]
457 [-5.785] [29.886] [61.675]
458 [-2.143] [29.222] [62.469]
459 [-0.946] [25.828] [61.248]
460 [2.239] [25.804] [63.373]
[461 rows x 3 columns]
What I intend to do is to create a Euclidean distance matrix using these X, Y, and Z values. I tried to do this using the pdist function
dist = pdist(coord_table, metric = 'euclidean')
distance_matrix = squareform(dist)
print(distance_matrix)
However, the interpreter gives the following error
ValueError: setting an array element with a sequence.
I am not sure how to interpret this error or how to fix it.
Change your loop
arr_coord = []
for chains in structure:
for chain in chains:
for residue in chain:
for atom in residue:
x = atom.get_coord()
arr_coord.append({'X': x[0],'Y':x[1],'Z':x[2]}) # here do not need list of list

Format Pandas Pivot Table

I met a problem in formatting pivot table that created by Pandas.
So I made a matrix table between 2 columns (A,B) from my source data, by using pandas.pivot_table with A as Column, and B as Index.
>> df = PD.read_excel("data.xls")
>> table = PD.pivot_table(df,index=["B"],
values='Count',columns=["A"],aggfunc=[NUM.sum],
fill_value=0,margins=True,dropna= True)
>> table
It returns as:
sum
A 1 2 3 All
B
1 23 52 0 75
2 16 35 12 65
3 56 0 0 56
All 95 87 12 196
And I hope to have a format like this:
A All_B
1 2 3
1 23 52 0 75
B 2 16 35 12 65
3 56 0 0 56
All_A 95 87 12 196
How should I do this? Thanks very much ahead.
The table returned by pd.pivot_table is very convenient to do work on (it's single-level index/column) and normally does NOT require any further format manipulation. But if you insist on changing the format to the one you mentioned in the post, then you need to construct a multi-level index/column using pd.MultiIndex. Here is an example on how to do it.
Before manipulation,
import pandas as pd
import numpy as np
np.random.seed(0)
a = np.random.randint(1, 4, 100)
b = np.random.randint(1, 4, 100)
df = pd.DataFrame(dict(A=a,B=b,Val=np.random.randint(1,100,100)))
table = pd.pivot_table(df, index='A', columns='B', values='Val', aggfunc=sum, fill_value=0, margins=True)
print(table)
B 1 2 3 All
A
1 454 649 770 1873
2 628 576 467 1671
3 376 247 481 1104
All 1458 1472 1718 4648
After:
multi_level_column = pd.MultiIndex.from_arrays([['A', 'A', 'A', 'All_B'], [1,2,3,'']])
multi_level_index = pd.MultiIndex.from_arrays([['B', 'B', 'B', 'All_A'], [1,2,3,'']])
table.index = multi_level_index
table.columns = multi_level_column
print(table)
A All_B
1 2 3
B 1 454 649 770 1873
2 628 576 467 1671
3 376 247 481 1104
All_A 1458 1472 1718 4648

Categories

Resources