Create list headings in Panda - python

I had just started with notebooks and Python. When I print my list will I just the list and then number how often they occur.
I would like to if I could get hour as a header above 12 and 8 and count as a header above 7 and 3.
x=df['hour'].value_counts()
print(x[0:2])
12 7
8 3
How I want it:
Hour . count
12 7
8 3
For the moment am I getting this below my results Name: hour, dtype:
int64
/Lisa

I am not sure, since the question is not so clear, but, as I understood here is my solution:
df[['hour', '_column_tobe_counted_']].groupby(['hour']).agg(['count']).reset_index()
Hope this helps.

Related

Cluster similar - but not identical - digits in pandas dataframe

I have a pandas dataframe with 2M+ rows. One of the columns pin, contains a series of 14 digits.
I'm trying to cluster similar — but not identical — digits. Specifically, I want to match the first 10 digits without regard to the final four. The pin column was imported as an int then converted to a string.
Put another way, the first 10 digits should match but the final four shouldn't. Duplicates of exact-matching pins should be dropped.
For example these should all be grouped together:
17101110141403
17101110141892
17101110141763
17101110141199
17101110141788
17101110141851
17101110141831
17101110141487
17101110141914
17101110141843
Desired output:
Biggest cluster | other columns
Second biggest cluster | other columns
...and so on | other columns
I've tried using a combination of groupby and regex without success.
pat2 = '1710111014\d\d\d\d'
pat = '\d\d\d\d\d\d\d\d\d\d\d\d\d\d'
grouped = df2.groupby(df2['pin'].str.extract(pat, expand=False), axis= 1)
and
df.groupby(['pin']).filter(lambda group: re.match > 1)
Here's a link to the original data set: https://datacatalog.cookcountyil.gov/Property-Taxation/Assessor-Parcel-Sales/wvhk-k5uv
It's not clear why you need regex for this, what about the following(assuming pin is stored as a string) (Note that you haven't included your expected output)
pin
0 17101110141403
1 17101110141892
2 17101110141763
3 17101110141199
4 17101110141788
5 17101110141851
6 17101110141831
7 17101110141487
8 17101110141914
9 17101110141843
df.groupby(df['pin'].str[:10]).size()
pin
1710111014 10
dtype: int64
If you want this information appended back to your original dataframe, you can use
df['size']=df.groupby(df['pin'].astype(str).str[:10])['pin'].transform(len)
pin size
0 17101110141403 10
1 17101110141892 10
2 17101110141763 10
3 17101110141199 10
4 17101110141788 10
5 17101110141851 10
6 17101110141831 10
7 17101110141487 10
8 17101110141914 10
9 17101110141843 10
Then, assuming you have more columns, you can sort your dataframe by size of cluster with
df.sort_values('size')

Merging two dataframes based on index

I've been on this all night, and just can't figure it out, even though I know it should be simple. So, my sincerest apologies for the following incantation from a sleep-deprived fellow:
So, I have four fields, Employee ID, Name, Station and Shift (ID is non-null integer, the rest are strings or null).
I have about 10 dataframes, all indexed by ID. And each containing only two columns either (Name and Station) or (Name and Shift)
Now of course, I want to combine all of this into one dataframe, which has a unique row for each ID.
But I'm really frustrated by it at this point(especially because I can't find a way to directly check how many unique indices my final dataframe ends with)
After messing around with some very ugly ways of using .merge(), I finally found .concat(). But it keeps making multiple rows per ID, when I check in excel, the indices are like Table1/1234, Table2/1234 etc. One row has the shift, the other one has station, which is precisely what I'm trying to avoid.
How do I compile all my data into one dataframe, having exactly one row per ID? Possibly without using 9 different merge statements, as I have to scale up later.
If I understand your question correctly, this is the thing that you want.
For example with this 3 dataframes..
In [1]: df1
Out[1]:
0 1 2
0 3.588843 3.566220 6.518865
1 7.585399 4.269357 4.781765
2 9.242681 7.228869 5.680521
3 3.600121 3.931781 4.616634
4 9.830029 9.177663 9.842953
5 2.738782 3.767870 0.925619
6 0.084544 6.677092 1.983105
7 5.229042 4.729659 8.638492
8 8.575547 6.453765 6.055660
9 4.386650 5.547295 8.475186
In [2]: df2
Out[2]:
0 1
0 95.013170 90.382886
2 1.317641 29.600709
4 89.908139 21.391058
6 31.233153 3.902560
8 17.186079 94.768480
In [3]: df
Out[3]:
0 1 2
0 0.777689 0.357484 0.753773
1 0.271929 0.571058 0.229887
2 0.417618 0.310950 0.450400
3 0.682350 0.364849 0.933218
4 0.738438 0.086243 0.397642
5 0.237481 0.051303 0.083431
6 0.543061 0.644624 0.288698
7 0.118142 0.536156 0.098139
8 0.892830 0.080694 0.084702
9 0.073194 0.462129 0.015707
You can do
pd.concat([df,df1,df2], axis=1)
This produces
In [6]: pd.concat([df,df1,df2], axis=1)
Out[6]:
0 1 2 0 1 2 0 1
0 0.777689 0.357484 0.753773 3.588843 3.566220 6.518865 95.013170 90.382886
1 0.271929 0.571058 0.229887 7.585399 4.269357 4.781765 NaN NaN
2 0.417618 0.310950 0.450400 9.242681 7.228869 5.680521 1.317641 29.600709
3 0.682350 0.364849 0.933218 3.600121 3.931781 4.616634 NaN NaN
4 0.738438 0.086243 0.397642 9.830029 9.177663 9.842953 89.908139 21.391058
5 0.237481 0.051303 0.083431 2.738782 3.767870 0.925619 NaN NaN
6 0.543061 0.644624 0.288698 0.084544 6.677092 1.983105 31.233153 3.902560
7 0.118142 0.536156 0.098139 5.229042 4.729659 8.638492 NaN NaN
8 0.892830 0.080694 0.084702 8.575547 6.453765 6.055660 17.186079 94.768480
9 0.073194 0.462129 0.015707 4.386650 5.547295 8.475186 NaN NaN
For more details you might want to see pd.concat
Just a tip putting simple illustrative data in your question always helps in getting answer.

how to flip columns in reverse order using shell scripting/python

Dear experts i have a small problem where i just want to reverse the columns.For example i have a data sets arranged in 4 columns i need to put last column first, and so on reversely...how can this work be done...i hope some expert will definitely answer my questions.Thanks
in put data example
1 2 3 4 5
6 7 8 9 0
3 4 5 2 1
5 6 7 2 3
i need output like as below
5 4 3 2 1
0 9 8 7 6
1 2 5 4 3
3 2 7 6 5
Perl to the rescue!
perl -lane 'print join " ", reverse #F' < input-file
-n reads the file line by line, running the code specified after -e for each line
-l removes newlines from input and adds them to output
-a splits the input line on whitespace populating the #F array
reverse reverses the array
join turns a list to a string
What is the type of your data? In python, if your data is a Numpy array then just do data[:, ::-1]. It also work list (but for the first dimension obsviously. In fact it is the general behavior of Python Slice (first, end, stride), where first and last are omitted. It works with any object supported indexing.
But if it is the only data manipulation that you have to do, it may be overkill to use Python to do it. However, it may be more efficient than raw string manipulation (using perl or whatever) depending of the size of your data.

How to replace 3 largest values sorted by index in a column

I have an easy question, but I've been struggling with the answer. I have a DataFrame, from which I want to replace the 3 largest values with their 7 day rolling means, but in index order. So for a DataFrame like this one:
Sales
2
4
6
8
10
12
14
100
100
200
I want to replace first the two rows with 100 in Sales, and then the row with 200. I tried the following:
df.Sales.replace(df.Sales.nlargest(3).sort_index(),df.Sales.rolling(window=7).mean())
But it brings the following error:
AttributeError: 'numpy.float64' object has no attribute 'replace'
I know that this works:
df.Sales.replace(df.Sales.max(),df.Sales.rolling(window=7).mean())
And I could do that 3 times, but I have the problem that it would replace 200 first, and then the others, so it isn't exactly what I need.
I guess something like this would work:
for i in df.Sales.nlargest(3).sort_index():
df.Sales.replace(i, df.Sales.rolling(window=7)
But I would rather avoid loops. Is it possible?
EDIT: expected output would be:
Sales
2
4
6
8
10
12
14
8
8.86
9.55
In other words, replacing the first 100 with the average from 2 to 14, which is 8. Then replacing the second 100 with the average between 4 through the second 8, which is 8.86, and so on.

counting T/F values for several conditions

I am a beginner using pandas.
I'm looking for mutations on several patients. I have 16 different conditions. I simply write a code about it but how can do this by for loop? I try to find the changes on MUT column and set them as True and False. Then try to count the True/False numbers. I have done for only 4.
Can you suggest a more simple way, instead of writing the same code 16 times?
s1=df["MUT"]
A_T= s1.str.contains("A:T")
ATnum= A_T.value_counts(sort=True)
s2=df["MUT"]
A_G=s2.str.contains("A:G")
AGnum=A_G.value_counts(sort=True)
s3=df["MUT"]
A_C=s3.str.contains("A:C")
ACnum=A_C.value_counts(sort=True)
s4=df["MUT"]
A__=s4.str.contains("A:-")
A_num=A__.value_counts(sort=True)
I'm not an expert with using Pandas, so don't know if there's a cleaner way of doing this, but perhaps the following might work?
chars = 'TGC-'
nums = {}
for char in chars:
s = df["MUT"]
A = s.str.contains("A:" + char)
num = A.value_counts(sort=True)
nums[char] = num
ATnum = nums['T']
AGnum = nums['G']
# ...etc
Basically, go through each unique character (T, G, C, -) then pull out the values that you need, then finally stick the numbers in a dictionary. Then, once the loop is finished, you can fetch whatever numbers you need back out of the dictionary.
Just use value_counts, this will give you a count of all unique values in your column, no need to create 16 variables:
In [5]:
df = pd.DataFrame({'MUT':np.random.randint(0,16,100)})
df['MUT'].value_counts()
Out[5]:
6 11
14 10
13 9
12 9
1 8
9 7
15 6
11 6
8 5
5 5
3 5
2 5
10 4
4 4
7 3
0 3
dtype: int64

Categories

Resources