I've managed to display the columns from the csv to pandas on Python 3. However, the columns are being separated to 3 lines. Is it possible to squeeze all the columns onto a single line? This was done on jupyter notebook.
import pandas as pd
import numpy as np
raw = pd.read_csv("D:/Python/vitamin.csv")
print(raw.head())
Result
RowID Gender BMI Energy_Actual VitaminA_Actual VitaminC_Actual \
0 1 F 18.0 1330 206 15
1 2 F 25.0 1792 469 59
2 3 F 21.6 1211 317 18
3 4 F 23.9 1072 654 24
4 5 F 24.3 1534 946 118
Calcium_Actual Iron_Actual Energy_DRI VitaminA_DRI VitaminC_DRI \
0 827 22 1604 700 65
1 900 12 2011 700 65
2 707 7 2242 700 75
3 560 11 1912 700 75
4 851 12 1895 700 65
Calcium_DRI Iron_DRI
0 1300 15
1 1300 15
2 1000 8
3 1000 18
4 1300 15
You should use below code in the beginning, please refer pandas.set_option:
pd.set_option('display.expand_frame_repr', False)
Related
im new to python and trying to understand data manipulation
df
Alpha AlphaComboCount
12-99 8039
22-99 1792
12-99,138-99 1776
12-45,138-45 1585
21-99 1225
123-99 1145
121-99 1102
21-581 1000
121-99,22-99 909
32-99 814
21-141 75
12-581,12-99 711
347-99 685
2089-281 685
123-49,121-29,22-79 626
121-99,123-99,22-99 4
As you can see above, there are two columns. Alpha being a string made out of concatenation of 2 codes seperated by '-'. My objective is to find the aggregate percentage of alphacombocount by the first code.
For example:
where there is 21 subcode-
Alpha AlphaComboCount Percent
21-99 1225 53%
21-141 75 3.2%
21-581 1000 43.3%
The objective as you see above is to get a corresponding percentage. since the total aggregrate here is 2300 of the 21 subcode.
Where it gets more complicated is for combination codes:
123-49,121-29,22-79 626 99%
121-99,123-99,22-99 4 0.6%
As you see above, all the first sub codes are the same but rearranged. This is also a valid case to get percentage values. As long as the combination is the same of the first subcode before'-'. How can i go about this to get the percentage values for all alpha combinations? is there an algorithm for this?
First, you want to separate the codes within a cell, then you can extract the first code and groupby:
# separate the codes
tmp = df.assign(FirstCode=df.Alpha.str.split(','))
# extract the first code
tmp['FirstCode'] = [tuple(sorted(set(x.split('-')[0] for x in cell)))
for cell in tmp.FirstCode]
# sum per each first codes with groupby
sum_per_code = tmp['AlphaComboCount'].groupby(tmp['FirstCode']).transform('sum')
# percentage is just a simple division
tmp['Percent'] = tmp['AlphaComboCount']/sum_per_code
# let's print the output:
print(tmp.sort_values('FirstCode'))
Output:
Alpha AlphaComboCount FirstCode Percent
0 12-99 8039 (12,) 0.918743
11 12-581,12-99 711 (12,) 0.081257
2 12-99,138-99 1776 (12, 138) 0.528414
3 12-45,138-45 1585 (12, 138) 0.471586
6 121-99 1102 (121,) 1.000000
14 123-49,121-29,22-79 626 (121, 123, 22) 0.993651
15 121-99,123-99,22-99 4 (121, 123, 22) 0.006349
8 121-99,22-99 909 (121, 22) 1.000000
5 123-99 1145 (123,) 1.000000
13 2089-281 685 (2089,) 1.000000
4 21-99 1225 (21,) 0.532609
7 21-581 1000 (21,) 0.434783
10 21-141 75 (21,) 0.032609
1 22-99 1792 (22,) 1.000000
9 32-99 814 (32,) 1.000000
12 347-99 685 (347,) 1.000000
If you have a number of codes in Alpha column, in different order,
then one of possible solutions is to extract one of them (e.g. minimal),
then take the part before '-', save it in a new column and use in further
processing:
df['Alpha_1'] = df.Alpha.str.split(',')\
.apply(lambda lst: min(lst)).str.split('-', expand=True)[0]
The result is:
Alpha AlphaComboCount Alpha_1
0 12-99 8039 12
1 22-99 1792 22
2 12-99,138-99 1776 12
3 12-45,138-45 1585 12
4 21-99 1225 21
5 123-99 1145 123
6 121-99 1102 121
7 21-581 1000 21
8 121-99,22-99 909 121
9 32-99 814 32
10 21-141 75 21
11 12-581,12-99 711 12
12 347-99 685 347
13 2089-281 685 2089
14 123-49,121-29,22-79 626 121
15 121-99,123-99,22-99 4 121
To compute percentage of AlphaComboCount in each group (with
particular value of Alpha_1), define the following function:
def proc(grp):
return (grp.AlphaComboCount / grp.AlphaComboCount.sum()
* 100).apply('{0:.2f}%'.format)
Group df by Alpha_1 and apply this function, saving the result
in Grp_pct column:
df['Grp_pct'] = df.groupby('Alpha_1').apply(proc).reset_index(level=0, drop=True)
To inspect the result easily, with rows from each group together,
print df the following way:
print(df.sort_values('Alpha_1'))
getting:
Alpha AlphaComboCount Alpha_1 Grp_pct
0 12-99 8039 12 66.38%
2 12-99,138-99 1776 12 14.66%
3 12-45,138-45 1585 12 13.09%
11 12-581,12-99 711 12 5.87%
6 121-99 1102 121 41.73%
8 121-99,22-99 909 121 34.42%
14 123-49,121-29,22-79 626 121 23.70%
15 121-99,123-99,22-99 4 121 0.15%
5 123-99 1145 123 100.00%
13 2089-281 685 2089 100.00%
4 21-99 1225 21 53.26%
7 21-581 1000 21 43.48%
10 21-141 75 21 3.26%
1 22-99 1792 22 100.00%
9 32-99 814 32 100.00%
12 347-99 685 347 100.00%
Now, e.g. compare the section concerning Alpha_1 == 21 with
your expected result for subcode 21.
I'm learning how to use python for data analysis and I have my first few dataframes to work with that I have pulled from video games I play.
So the dataframe I'm working with currently uses the header row for all the player names (8 players)
All the statistics are the first column.
Is it a better practice to have these positions reversed. i.e. should all the players be in the first col instead of the first row?
Arctic Shat Sly Snky Nanm Zax zack Sorn Cort
Statistics
Assists 470 415 388 182 212 92 40 5 4
Avg Damage Dealt 203.82 167.37 165.2 163.45 136.3 85.08 114.96 128.72 26.71
Boosts 1972 1807 1790 668 1392 471 103 7 33
Damage Dealt 236222.66 239680.08 164373.73 74696.195 99904.48 27991.652 13910.629 901.01385 1228.7041
Days 206 234 218 78 157 94 29 3 10
Head Shot Kills 395 307 219 119 130 29 12 0 0
Headshot % 26.37% 18.65% 18.96% 23.85% 19.58% 16.11% 17.14% 0% 0%
Heals 3139 4385 2516 1326 2007 749 382 15 78
K/D 1.36 1.2 1.22 1.13 0.95 0.58 0.59 0.57 0.07
Kills 1498 1646 1155 499 664 180 70 4 3
Longest Kill 461.77765 430.9177 410.534 292.18732 354.3065 287.72366 217.98175 110.25433 24.15225
Longest Time Survived 2051.842 2180.98 1984.259 1948.513 2064.065 1979.101 2051.846 1486.288 1670.048
Losses 1117 1376 959 448 709 320 119 7 46
Max Kill Streaks 4 4 4 3 4 3 3 1 1
Most Survival Time 2051.842 2180.98 1984.259 1948.513 2064.065 1979.101 2051.846 1486.288 1670.048
Revives 281 455 155 104 221 83 19 2 2
Ride Distance 1610093.4 2157408.8 1572710 486170.5 714986.3 524297 204585.53 156.07877 63669.613
Road Kills 1 4 5 4 0 0 0 0 0
Round Most Kills 9 8 9 7 9 6 5 2 1
Rounds Played 1159 1432 995 457 733 329 121 7 46
Suicides 16 42 14 6 10 4 4 0 2
Swim Distance 2830.028 4966.6914 2703.0044 1740.3292 2317.7866 1035.3792 395.86472 0 92.01848
Team Kills 22 47 23 9 15 4 5 0 2
Time Survived 969792.2 1284232.6 930141.94 328190.22 637273.3 284434.3 109724.04 4580.869 37748.414
Top10s 531 654 509 196 350 187 74 2 28
Vehicle Destroys 23 9 29 4 15 3 1 0 0
Walk Distance 1545281.6 1975185 1517812 505191 1039509.8 461860.53 170913.25 9665.322 63900.125
Weapons Acquired 5043 7226 4683 1551 2909 1514 433 23 204
Wins 55 63 48 17 32 19 3 0 3
dBNOs 1489 1575 1058 488 587 179 78 5 8
Yes, it is better to transpose.
The current best practice is to have one instance (row) for each observation, in your case, player. And one feature (column) for each variable.
This is called "tidy data" (from the paper published by Hadley Wickham). Tidy data works more or less like guidelines for us, data scientists, much like normalization rules for relational database people.
Also Most frameworks/programs/data structures are implemented considering this organization. For instance, in python pandas, using a dataframe with this data you have, if you would want to check out the average headshots, would need to check just a df['Head Shot Kills'].mean() (if it was transposed...).
id and st
id = [243,2352,474, 84,443]
st = [1,3,5,9,2,6,7]
I wish to create a pandas dataframe df using them so that each value of the list id have all values from st list.
My expected output is like:
id st
243 1
243 3
243 5
243 9
243 2
243 6
243 7
2352 1
2352 3
2352 5
2352 9
2352 2
2352 6
2352 7
and so on...
How can I create the same pandas dataframe ?
Use itertools.product with DataFrame constructor:
from itertools import product
#pandas 0.24+
df = pd.DataFrame(product(id, st), columns = ['id','st'])
#pandas below
#df = pd.DataFrame(list(product(id, st)), columns = ['id','st'])
print (df)
id st
0 243 1
1 243 3
2 243 5
3 243 9
4 243 2
5 243 6
6 243 7
7 2352 1
8 2352 3
9 2352 5
10 2352 9
11 2352 2
12 2352 6
13 2352 7
14 474 1
15 474 3
16 474 5
17 474 9
18 474 2
19 474 6
20 474 7
21 84 1
22 84 3
23 84 5
24 84 9
25 84 2
26 84 6
27 84 7
28 443 1
29 443 3
30 443 5
31 443 9
32 443 2
33 443 6
34 443 7
Use list comprehension with the pandas.DataFrame constructor:
df = pd.DataFrame([(i, s) for i in id for s in st], columns=['id', 'st'])
[out]
id st
0 243 1
1 243 3
2 243 5
3 243 9
4 243 2
5 243 6
6 243 7
7 2352 1
8 2352 3
9 2352 5
...
25 84 2
26 84 6
27 84 7
28 443 1
29 443 3
30 443 5
31 443 9
32 443 2
33 443 6
34 443 7
Try below code, if it helps:
pd.DataFrame({'id': sorted(id * len(st)), 'st': st * len(id)})
I have txt file with such values:
108,612,620,900
168,960,680,1248
312,264,768,564
516,1332,888,1596
I need to read all of this into a single row of data frame.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 108 612 620 900 168 960 680 1248 312 264 768 564 516 1332 888 1596
I have many such files and so I'll keep appending rows to this data frame.
I believe we need some kind of regex but I'm not able to figure it out. For now this is what I have :
df = pd.read_csv(f,sep=",| ", header = None)
But this takes , and (space) as separators where as I want it to take newline as a separator.
First, read the data:
df = pd.read_csv('test/t.txt', header=None)
It gives you a DataFrame shaped like the CSV. Then concatenate:
s = pd.concat((df.loc[i] for i in df.index), ignore_index=True)
It gives you a Series:
0 108
1 612
2 620
3 900
4 168
5 960
6 680
7 1248
8 312
9 264
10 768
11 564
12 516
13 1332
14 888
15 1596
dtype: int64
Finally, if you really want a horizontal DataFrame:
pd.DataFrame([s])
Gives you:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 108 612 620 900 168 960 680 1248 312 264 768 564 516 1332 888 1596
Since you've mentioned in a comment that you have many such files, you should simply store all the Series in a list, and construct a DataFrame with all of them at once when you're finished loading them all.
I have a dataframe containing strings, as read from a sloppy csv:
id Total B C ...
0 56 974 20 739 34 482
1 29 479 10 253 16 704
2 86 961 29 837 43 593
3 52 687 22 921 28 299
4 23 794 7 646 15 600
What I want to do: convert every cell in the frame into a number. It should be ignoring whitespaces, but put NaN where the cell contains something really strange.
I probably know how to do it using terribly unperformant manual looping and replacing values, but was wondering if there's a nice and clean why to do this.
You can use read_csv with regex separator \s{2,} - 2 or more whitespaces and parameter thousands:
import pandas as pd
from pandas.compat import StringIO
temp=u"""id Total B C
0 56 974 20 739 34 482
1 29 479 10 253 16 704
2 86 961 29 837 43 593
3 52 687 22 921 28 299
4 23 794 7 646 15 600 """
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), sep="\s{2,}", engine='python', thousands=' ')
print (df)
id Total B C
0 0 56974 20739 34482
1 1 29479 10253 16704
2 2 86961 29837 43593
3 3 52687 22921 28299
4 4 23794 7646 15600
print (df.dtypes)
id int64
Total int64
B int64
C int64
dtype: object
And then if necessary apply function to_numeric with parameter errors='coerce' - it replace non numeric to NaN:
df = df.apply(pd.to_numeric, errors='coerce')