Sum up / group by and export to csv - python

Suppose I have the following dataframe called "df":
ID;Type
1;A
1;A
1;A
1;A
1;A
2;A
2;A
3;B
4;A
4;B
Now I want to sum up / group by this dataframe by ID and Type and export the result as a csv.
The result should look like:
ID;Type;Count
1;A;5
2;A;2
3;B;1
4;A;1
4;B;1
My code for summing up / group by is as follows:
import pandas as pd
frameexport=df.groupby(["ID", "Type"])["Type"].count()
print(frameexport)
ID Type
1 A 5
2 A 2
3 B 1
4 A 1
B 1
This is not exaclty what I hoped for. The last entry for ID 4 is not repeating the "4" again.
I try to export it to a csv:
frameexport.to_csv(r'C:\Path\file.csv', index=False, sep=";")
But it doesn't work, it just looks like this:
Type
5
2
1
1
1
Problem here is that I call it on an object and obviously to_csv doesn't work like this.
How can I get the desired result:
ID Type Count
1 A 5
2 A 2
3 B 1
4 A 1
4 B 1
exported to a csv?

There is MultiIndex Serie, so need remove index=False and for correct column name is used rename:
frameexport.rename('Count').to_csv(r'C:\Path\file.csv', sep=";")
Or create DataFrame by Series.reset_index, then default index is necessary omit, so index=False is correct:
frameexport.reset_index(name='Count').to_csv(r'C:\Path\file.csv', index=False, sep=";")

Related

Adding values in a cell in Pandas

I am trying to add values in cells of one column in Pandas Dataframe. The dataframe was created:
data = [['ID_123456', 'example=1(abc)'], ['ID_123457', 'example=1(def)'], ['ID_123458', 'example=1(try)'], ['ID_123459', 'example=1(try)'], ['ID_123460', 'example=1(try),2(test)'], ['ID_123461', 'example=1(try),2(test),9(yum)'], ['ID_123462', 'example=1(try)'], ['ID_123463', 'example=1(try),7(test)']]
df = pd.DataFrame(data, columns = ['ID', 'occ'])
display(df)
The table looks like this:
ID occ
ID_123456 example=1(abc)
ID_123457 example=1(def)
ID_123458 example=1(try)
ID_123459 example=1(test)
ID_123460 example=1(try),2(test)
ID_123461 example=1(try),2(test),9(yum)
ID_123462 example=1(test)
ID_123463 example=1(try),7(test)
The following link is related to it but I was unable to run the command on my dataframe.
Sum all integers in a PANDAS DataFrame "cell"
The command gives an error of "string index out of range".
The output should look like this:
ID occ count
ID_123456 example=1(abc) 1
ID_123457 example=1(def) 1
ID_123458 example=1(try) 1
ID_123459 example=1(test) 1
ID_123460 example=1(try),2(test) 3
ID_123461 example=1(try),2(test),9(yum) 12
ID_123462 example=1(test) 1
ID_123463 example=1(try),7(test) 8
If want sum all numbers on column occ use Series.str.extractall, convert to integers with sum:
df['count'] = df['occ'].str.extractall('(\d+)')[0].astype(int).sum(level=0)
print (df)
ID occ count
0 ID_123456 example=1(abc) 1
1 ID_123457 example=1(def) 1
2 ID_123458 example=1(try) 1
3 ID_123459 example=1(try) 1
4 ID_123460 example=1(try),2(test) 3
5 ID_123461 example=1(try),2(test),9(yum) 12
6 ID_123462 example=1(try) 1
7 ID_123463 example=1(try),7(test) 8

How to add a MultiIndex after loading csv data into a pandas dataframe?

I am trying to add additional index rows to an existing pandas dataframe after loading csv data into it.
So let's say I load my data like this:
columns = ['Relative_Pressure','Volume_STP']
df = pd.read_csv(StringIO(contents), skiprows=4, delim_whitespace=True,index_col=False,header=None)
df.columns = columns
where contents is a string in csv format. The resulting DataFrame might look something like this:
For clarity reasons I would now like to add additional index rows to the DataFrame as shown here:
However in the link these multiple index rows are generated right when the DataFrame is created. I would like to add e.g. rows for unit or descr to the columns.
How could I do this?
You can create a MultiIndex on the columns by specifically creating the index and then assigning it to the columns separately from reading in the data.
I'll use the example from the link you provided. The first method is to create the MultiIndex when you make the dataframe:
df = pd.DataFrame({('A',1,'desc A'):[1,2,3],('B',2,'desc B'):[4,5,6]})
df.columns.names=['NAME','LENGTH','DESCRIPTION']
df
NAME A B
LENGTH 1 2
DESCRIPTION desc A desc B
0 1 4
1 2 5
2 3 6
As stated, this is not what you are after. Instead, you can make the dataframe (from your file for example) and then make the MultiIndex from a set of lists and then assign it to the columns:
df = pd.DataFrame({'desc A':[1,2,3], 'desc B':[4,5,6]})
# Output
desc A desc B
0 1 4
1 2 5
2 3 6
# Create a multiindex from lists
index = pd.MultiIndex.from_arrays((['A', 'B'], [1, 2], ['desc A', 'desc B']))
# Assign to the columns
df.columns = index
# Output
A B
1 2
desc A desc B
0 1 4
1 2 5
2 3 6
# Name the columns
df.columns.names = ['NAME','LENGTH','DESCRIPTION']
# Output
NAME A B
LENGTH 1 2
DESCRIPTION desc A desc B
0 1 4
1 2 5
2 3 6
There are other ways to construct a MultiIndex, for example, from_tuples and from_product. You can read more about Multi Indexes in the documentation.

Pandas DataFrames: Extract Information and Collapse Columns

I have a pandas DataFrame which contains information in columns which I would like to extract into a new column.
It is best explained visually:
df = pd.DataFrame({'Number Type 1':[1,2,np.nan],
'Number Type 2':[np.nan,3,4],
'Info':list('abc')})
The Table shows the initial DataFrame with Number Type 1 and NumberType 2 columns.
I would like to extract the types and create a new Type column, refactoring the DataFrame accordingly.
basically, Numbers are collapsed into the Number columns, and the types extracted into the Type column. The information in the Info column is bound to the numbers (f.e. 2 and 3 have the same information b)
What is the best way to do this in Pandas?
Use melt with dropna:
df = df.melt('Info', value_name='Number', var_name='Type').dropna(subset=['Number'])
df['Type'] = df['Type'].str.extract('(\d+)')
df['Number'] = df['Number'].astype(int)
print (df)
Info Type Number
0 a 1 1
1 b 1 2
4 b 2 3
5 c 2 4
Another solution with set_index and stack:
df = df.set_index('Info').stack().rename_axis(('Info','Type')).reset_index(name='Number')
df['Type'] = df['Type'].str.extract('(\d+)')
df['Number'] = df['Number'].astype(int)
print (df)
Info Type Number
0 a 1 1
1 b 1 2
2 b 2 3
3 c 2 4

how to write the pivot_table to txt file by python

I have get the pivot_table as follows:
there are spaces in the table,
what i want to write to txt is:
how to get it ?
chaoshidishi=pd.pivot_table(clsc,index="故障发生地市",values="工单号",aggfunc=len)
chaoshidishi=chaoshidishi.to_frame()
f=open('E:\gaotie\dishi.txt','w')
for row in chaoshidishi:
f.write(row[0]+row[1])
f.close()
Following up on #shanmuga's comment, you should be able to use to_csv() without first using to_frame().
First, here's some sample data that seems to reflect your setup:
import pandas as pd
group = ['a','a','b','c','c']
value = [1,2,3,4,5]
df = pd.DataFrame({'group':group,'value':value})
print(df)
group value
0 a 1
1 a 2
2 b 3
3 c 4
4 c 5
Now apply pivot_table():
df.pivot_table(columns='group', values='value', aggfunc=len)
group
a 2
b 1
c 2
Name: value, dtype: int64
You can save to file directly from this output. If you don't want to preserve index and column names, use header=None on load:
(df.pivot_table(columns='group', values='value', aggfunc=len)
.to_csv('foo.txt'))
newdf = pd.read_csv('foo.txt', header=None)
print(newdf)
0 1
0 a 2
1 b 1
2 c 2
To preserve column and index names, use the header argument on save, and the index_col argument on load:
(df.pivot_table(columns='group', values='value', aggfunc=len)
.to_csv('foo.txt', header='group'))
newdf = pd.read_csv('foo.txt', index_col='group')
print(newdf)
value
group
a 2
b 1
c 2

How to read all rows of a csv file using pandas in python?

I am using the pandas module for reading the data from a .csv file.
I can write out the following code to extract the data belonging to an individual column as follows:
import pandas as pd
df = pd.read_csv('somefile.tsv', sep='\t', header=0)
some_column = df.column_name
print some_column # Gives the values of all entries in the column
However, the file that I am trying to read now has more than 5000 columns and writing out the statement
some_column = df.column_name
is now not feasible. How can I get all the column values so that I can access them using indexing?
e.g to extract the value present at the 100th row and the 50th column, I should be able to write something like this:
df([100][50])
Use DataFrame.iloc or DataFrame.iat, but python counts from 0, so need 99 and 49 for select 100. row and 50. column:
df = df.iloc[99,49]
Sample - select 3. row and 4. column:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,10],
'E':[5,3,6],
'F':[7,4,3]})
print (df)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 10 6 3
print (df.iloc[2,3])
10
print (df.iat[2,3])
10
Combination for selecting by column name and position of row is possible by Series.iloc or Series.iat:
print (df['D'].iloc[2])
10
print (df['D'].iat[2])
10
Pandas has indexing for dataframes, so you can use
df.iloc[[index]]["column header"]
the index is in a list as you can pass multiple indexes at one in this way.

Categories

Resources