Have a dataframe restart it's count for a particular column - python

The Overview:
In our project, we are working with a CSV file that contains some data. We will call it smal.csv It is a bit of a chunky file that will be later used for some other algorithms. (Here is the gist in case the link to smal.csv is too badly formatted for your browser.)
The file will be loaded like this
filename = "smal.csv"
keyname = "someKeyname"
self.data[keyname] = spectral_data(pd.read_csv(filename, header=[0, 1], verbose=True))
The spectral class looks like this. As you can see, we do not actually keep the dataframe as is.
class spectral_data(object):
def __init__(self, df):
try:
uppercols = df.columns.levels[0]
lowercols = list(df.columns.levels[1].values)
except:
df.columns = pd.MultiIndex.from_tuples(list(df.columns))
uppercols = df.columns.levels[0]
lowercols = list(df.columns.levels[1].values)
for i, val in enumerate(lowercols):
try:
lowercols[i] = float(val)
except:
lowercols[i] = val
levels = [uppercols, lowercols]
df.columns.set_levels(levels, inplace=True)
self.df = df
After we've loaded it we'd like to concatenate it with another set of data, also loaded like smal.csv was.
Our concatenation is done like this.
new_df = pd.concat([self.data[dataSet1].df, self.data[dataSet2].df], ignore_index=True)
However, the ignore_index=True does not work, because the actual row that we are concatenating is not the index. However, we cannot simply remove the column, it is necessary for other parts of our program.
The Objective:
I'm trying to concatenate a couple of data frames together, however, what I thought was the index is not actually the index for the data frame. Thus the command
pd.concat([df1.df, df2.df], ignore_index=True)
will not work. I thought maybe using iloc to change each individual cell would work but I feel like this is not the most intuitive way to approach this.
How can I get a data frame that looks like this
[396 rows x 6207 columns]
Unnamed: 0_level_0 meta ... wvl
Unnamed: 0_level_1 Sample ... 932.695 932.89
0 1 NaN ... -12.33 9.67
1 2 NaN ... 11.94 3.94
2 3 NaN ... -2.67 28.33
3 4 NaN ... 53.22 -13.78
4 1 NaN ... 43.28
5 2 NaN ... 41.33 47.33
6 3 NaN ... -21.94 12.06
7 4 NaN ... -30.94 -1.94
8 5 NaN ... -24.78 40.22
Turn into this.
[396 rows x 6207 columns]
Unnamed: 0_level_0 meta ... wvl
Unnamed: 0_level_1 Sample ... 932.695 932.89
0 1 NaN ... -12.33 9.67
1 2 NaN ... 11.94 3.94
2 3 NaN ... -2.67 28.33
3 4 NaN ... 53.22 -13.78
4 5 NaN ... 43.28
5 6 NaN ... 41.33 47.33
6 7 NaN ... -21.94 12.06
7 8 NaN ... -30.94 -1.94
8 9 NaN ... -24.78 40.22

Related

Pandas dataframe merge row by addition

I want to create a dataframe from census data. I want to calculate the number of people that returned a tax return for each specific earnings group.
For now, I wrote this
census_df = pd.read_csv('../zip code data/19zpallagi.csv')
sub_census_df = census_df[['zipcode', 'agi_stub', 'N02650', 'A02650', 'ELDERLY', 'A07180']].copy()
num_of_returns = ['Number_of_returns_1_25000', 'Number_of_returns_25000_50000', 'Number_of_returns_50000_75000',
'Number_of_returns_75000_100000', 'Number_of_returns_100000_200000', 'Number_of_returns_200000_more']
for i, column_name in zip(range(1, 7), num_of_returns):
sub_census_df[column_name] = sub_census_df[sub_census_df['agi_stub'] == i]['N02650']
I have 6 groups attached to a specific zip code. I want to get one row, with the number of returns for a specific zip code appearing just once as a column. I already tried to change NaNs to 0 and to use groupby('zipcode').sum(), but I get 50 million rows summed for zip code 0, where it seems that only around 800k should exist.
Here is the dataframe that I currently get:
zipcode agi_stub N02650 A02650 ELDERLY A07180 Number_of_returns_1_25000 Number_of_returns_25000_50000 Number_of_returns_50000_75000 Number_of_returns_75000_100000 Number_of_returns_100000_200000 Number_of_returns_200000_more Amount_1_25000 Amount_25000_50000 Amount_50000_75000 Amount_75000_100000 Amount_100000_200000 Amount_200000_more
0 0 1 778140.0 10311099.0 144610.0 2076.0 778140.0 NaN NaN NaN NaN NaN 10311099.0 NaN NaN NaN NaN NaN
1 0 2 525940.0 19145621.0 113810.0 17784.0 NaN 525940.0 NaN NaN NaN NaN NaN 19145621.0 NaN NaN NaN NaN
2 0 3 285700.0 17690402.0 82410.0 9521.0 NaN NaN 285700.0 NaN NaN NaN NaN NaN 17690402.0 NaN NaN NaN
3 0 4 179070.0 15670456.0 57970.0 8072.0 NaN NaN NaN 179070.0 NaN NaN NaN NaN NaN 15670456.0 NaN NaN
4 0 5 257010.0 35286228.0 85030.0 14872.0 NaN NaN NaN NaN 257010.0 NaN NaN NaN NaN NaN 35286228.0 NaN
And here is what I want to get:
zipcode Number_of_returns_1_25000 Number_of_returns_25000_50000 Number_of_returns_50000_75000 Number_of_returns_75000_100000 Number_of_returns_100000_200000 Number_of_returns_200000_more
0 0 778140.0 525940.0 285700.0 179070.0 257010.0 850.0
here is one way to do it using groupby and sum the desired columns
num_of_returns = ['Number_of_returns_1_25000', 'Number_of_returns_25000_50000', 'Number_of_returns_50000_75000',
'Number_of_returns_75000_100000', 'Number_of_returns_100000_200000', 'Number_of_returns_200000_more']
df.groupby('zipcode', as_index=False)[num_of_returns].sum()
zipcode Number_of_returns_1_25000 Number_of_returns_25000_50000 Number_of_returns_50000_75000 Number_of_returns_75000_100000 Number_of_returns_100000_200000 Number_of_returns_200000_more
0 0 778140.0 525940.0 285700.0 179070.0 257010.0 0.0
This question needs more information to actually give a proper answer. For example you leave out what is meant by certain columns in your data frame:
- `N1: Number of returns`
- `agi_stub: Size of adjusted gross income`
According to IRS this has the following levels.
Size of adjusted gross income "0 = No AGI Stub
1 = ‘Under $1’
2 = '$1 under $10,000'
3 = '$10,000 under $25,000'
4 = '$25,000 under $50,000'
5 = '$50,000 under $75,000'
6 = '$75,000 under $100,000'
7 = '$100,000 under $200,000'
8 = ‘$200,000 under $500,000’
9 = ‘$500,000 under $1,000,000’
10 = ‘$1,000,000 or more’"
I got the above from https://www.irs.gov/pub/irs-soi/16incmdocguide.doc
With this information, I think what you want to find is the number of
people who filed a tax return for each of the income levels of agi_stub.
If that is what you mean then, this can be achieved by:
import pandas as pd
data = pd.read_csv("./data/19zpallagi.csv")
## select only the desired columns
data = data[['zipcode', 'agi_stub', 'N1']]
## solution to your problem?
df = data.pivot_table(
index='zipcode',
values='N1',
columns='agi_stub',
aggfunc=['sum']
)
## bit of cleaning up.
PREFIX = 'agi_stub_level_'
df.columns = [PREFIX + level for level in df.columns.get_level_values(1).astype(str)]
Here's the output.
In [77]: df
Out[77]:
agi_stub_level_1 agi_stub_level_2 ... agi_stub_level_5 agi_stub_level_6
zipcode ...
0 50061850.0 37566510.0 ... 21938920.0 8859370.0
1001 2550.0 2230.0 ... 1420.0 230.0
1002 2850.0 1830.0 ... 1840.0 990.0
1005 650.0 570.0 ... 450.0 60.0
1007 1980.0 1530.0 ... 1830.0 460.0
... ... ... ... ... ...
99827 470.0 360.0 ... 170.0 40.0
99833 550.0 380.0 ... 290.0 80.0
99835 1250.0 1130.0 ... 730.0 190.0
99901 1960.0 1520.0 ... 1030.0 290.0
99999 868450.0 644160.0 ... 319880.0 142960.0
[27595 rows x 6 columns]

how to merge two excel columns in python using colab

I'm working on a data-oriented project, we have some cancer measurements, and want to classify with K-means algorithms.
Now I have two basic example datasets, with two-two columns, but the K-means algorithms need only 2 columns, so I decided to concatenate the columns, but how can I do it?
For example fst dataset looks like this:
0 2713.9 566.42
1 2718.9 566.42
2 2723.3 566.25
3 2729.5 565.99
4 2735.9 565.83
the snd one looks like this:
0 6571.5 959.12
1 6571.6 959.13
2 6571.7 959.12
3 6571.7 959.16
4 6571.7 959.15
And I want something like this (without the row number of course):
0 2713.9 566.42
1 2718.9 566.42
2 2723.3 566.25
3 2729.5 565.99
4 2735.9 565.83
0 6571.5 959.12
1 6571.6 959.13
2 6571.7 959.12
3 6571.7 959.16
4 6571.7 959.15
I tried with this:
X = ds1[ds1.columns[2:4]].append(ds2[ds2.columns[2:4]])
X
and got this:
0 2713.9 566.42 NaN NaN
1 2718.9 566.42 NaN NaN
2 2723.3 566.25 NaN NaN
3 2729.5 565.99 NaN NaN
4 2735.9 565.83 NaN NaN
... ... ... ... ...
44 NaN NaN 6571.8 959.01
45 NaN NaN 6571.7 959.00
46 NaN NaN 6571.7 958.98
47 NaN NaN 6571.5 959.00
48 NaN NaN 6571.4 959.01
Also got this with this code:
X = pd.concat([ds1[ds1.columns[2:4]], ds2[ds2.columns[2:4]]], axis=0, join='outer', ignore_index=False)
How can I do this? Is there any method for this, or I have to transform the data in Excel?
Try via vstack():
out=pd.DataFrame(np.vstack((ds1.columns[2:4].values,ds2[ds2.columns[2:4]].values)))
OR
via concatenate():
out=pd.DataFrame(np.concatenate((ds1.columns[2:4].values,ds2[ds2.columns[2:4]].values)))
OR
out=ds1[ds1.columns[2:4]].append(ds2[ds2.columns[2:4]]).T.agg(sorted,key=pd.isnull).dropna().T
OR
You can also rename the name of columns of any 1 dataset so that both subset of df's has same name then use concat() or append() them

Pandas how to preserve all values in dataframe into a csv?

I want to convert the html to csv using pandas functions
This is a part of what I read in the dataframe df
0 1
0 sequence 2
1 trainNo K805
2 trainNumber K805
3 departStation 鹰潭
4 departStationPy yingtan
5 arriveStation 南昌
6 arriveStationPy nanchang
7 departDate 2020-05-24
8 departTime 03:55
9 arriveDate 2020-05-24
10 arriveTime 05:44
11 isStartStation False
12 isEndStation False
13 runTime 1小时49分钟
14 preSaleTime NaN
15 takeDays 0
16 isBookable True
17 seatList seatNamepriceorderPriceinventoryisBookablebutt...
18 curSeatIndex 0
seatName price orderPrice inventory isBookable buttonDisplayName buttonType
0 硬座 23.5 23.5 99 True NaN 0
1 硬卧 69.5 69.5 99 True NaN 0
2 软卧 104.5 104.5 4 True NaN 0
0 1
0 departDate 2020-05-23
1 departStationList NaN
2 endStationList NaN
3 departStationFilterMap NaN
4 endStationFilterMap NaN
5 departCityName 上海
6 arriveCityName 南昌
7 gtMinPrice NaN
My code is like this
for i, df in enumerate(pd.read_html(html,encoding='utf-8')):
df.to_csv(".\other.csv",index=True,encoding='utf-8-sig')
To preserve the characters in csv, I need to use utf-8-sig encoding. But I don't know how to write the format symbol %
,0,1
0,departDate,2020-05-23
1,departStationList,
2,endStationList,
3,departStationFilterMap,
4,endStationFilterMap,
5,departCityName,上海
6,arriveCityName,南昌
7,gtMinPrice,
This is what I got in csv file, only the last part is preserved.
The dataframe is correct, while the csv need correction. Can you show me how to make the correct output?
you're saving each dataframe to the same file, so each is getting overwritten until the last one.
note the addition of the f-string to change the save file name e.g. f".\other_{i}.csv"
each dataframe is a different shape, so they won't all fit together properly
To CSV
for i, df in enumerate(pd.read_html(html,encoding='utf-8')):
df.to_csv(f".\other_{i}.csv", index=True, encoding='utf-8-sig')
To Excel
with pd.ExcelWriter('output.xlsx', mode='w') as writer:
for i, df in enumerate(pd.read_html(html,encoding='utf-8')):
df.to_excel(writer, sheet_name=f'Sheet{i}', encoding='utf-8-sig')

Is there a possibility to use a bigger List in phython?

For school I have to make a project about wifisignals and I am trying put the data in a dataframe.
There are 208.000 rows of data.
And when it comes to the code below, the code does not complete. The code is like it is stuck in an infinite loop.
But when I use only a 1000 rows my program works. So I think that my list are to small if that is possible.
Do bigger Lists exist in phython? Or is it because I use bad coding?
Thanks in advance.
edit 1:
(data is the original dataframe and wifiinfo is a column of that)
i have this format:
df = pd.DataFrame(columns=['Sender','Time','Date','Place','X','Y','Bezetting','SSID','BSSID','Signal'])
And i am trying to fill SSID, BSSID and Signal from the Column WifiInfo for this i have to split the data.
this is how 1 WifiInfo looks like:
ODISEE#88-1d-fc-41-dc-50:-83,ODISEE#88-1d-fc-2c-c0-00:-72,ODISEE#88-1d-fc-41-d2-d0:-82,CiscoC5976#58-6d-8f-19-14-38:-78,CiscoC5959#58-6d-8f-19-13-f4:-93,SNB#c8-d7-19-6f-be-b7:-99,ODISEE#88-1d-fc-2c-c5-70:-94,HackingDemo#58-6d-8f-19-11-48:-156,ODISEE#88-1d-fc-30-d4-40:-85,ODISEE#88-1d-fc-41-ac-50:-100
My current approach looks like:
for index, row in data.iterrows():
bezettingList = list()
ssidList = list()
bssidList = list()
signalList = list()
#WifiInfo splitting
wifis = row.WifiInfo.split(',')
for wifi in wifis:
#split wifi and add to List
ssid, bssid = wifi.split('#')
bssid, signal = bssid.split(':')
ssidList.append(ssid)
bssidList.append(bssid)
signalList.append(int(signal))
#add bezettingen to List
bezettingen = row.Bezetting.split(',')
for bezetting in bezettingen:
bezettingList.append(bezetting)
#add list to dataframe
df.loc[index,'SSID'] = ssidList
df.loc[index,'BSSID'] = bssidList
df.loc[index,'Signal'] = signalList
df.loc[index,'Bezetting'] = bezettingList
df.head()
IIUC, you need to first explode the row by commas so that this:
SSID BSSID Signal WifiInfo
0 NaN NaN NaN ODISEE#88-1d-fc-41-dc-50:-83,ODISEE#88- ...
becomes this:
SSID BSSID Signal WifiInfo
0 NaN NaN NaN ODISEE#88-1d-fc-41-dc-50:-83
1 NaN NaN NaN ODISEE#88-1d-fc-2c-c0-00:-72
2 NaN NaN NaN ODISEE#88-1d-fc-41-d2-d0:-82
3 NaN NaN NaN CiscoC5976#58-6d-8f-19-14-38:-78
4 NaN NaN NaN CiscoC5959#58-6d-8f-19-13-f4:-93
5 NaN NaN NaN SNB#c8-d7-19-6f-be-b7:-99
6 NaN NaN NaN ODISEE#88-1d-fc-2c-c5-70:-94
7 NaN NaN NaN HackingDemo#58-6d-8f-19-11-48:-156
8 NaN NaN NaN ODISEE#88-1d-fc-30-d4-40:-85
9 NaN NaN NaN ODISEE#88-1d-fc-41-ac-50:-100
# use `.explode`
data = data.assign(WifiInfo=data.WifiInfo.str.split(',')).explode('WifiInfo')
Now you could use .str.extract:
data['SSID'] = data['WifiInfo'].str.extract(r'(.*)#')
data['BSSID'] = data['WifiInfo'].str.extract(r'#(.*):')
data['Signal'] = data['WifiInfo'].str.extract(r':(.*)')
SSID BSSID Signal WifiInfo
0 ODISEE 88-1d-fc-41-dc-50 -83 ODISEE#88-1d-fc-41-dc-50:-83
1 ODISEE 88-1d-fc-2c-c0-00 -72 ODISEE#88-1d-fc-2c-c0-00:-72
2 ODISEE 88-1d-fc-41-d2-d0 -82 ODISEE#88-1d-fc-41-d2-d0:-82
3 CiscoC5976 58-6d-8f-19-14-38 -78 CiscoC5976#58-6d-8f-19-14-38:-78
4 CiscoC5959 58-6d-8f-19-13-f4 -93 CiscoC5959#58-6d-8f-19-13-f4:-93
5 SNB c8-d7-19-6f-be-b7 -99 SNB#c8-d7-19-6f-be-b7:-99
6 ODISEE 88-1d-fc-2c-c5-70 -94 ODISEE#88-1d-fc-2c-c5-70:-94
7 HackingDemo 58-6d-8f-19-11-48 -156 HackingDemo#58-6d-8f-19-11-48:-156
8 ODISEE 88-1d-fc-30-d4-40 -85 ODISEE#88-1d-fc-30-d4-40:-85
9 ODISEE 88-1d-fc-41-ac-50 -100 ODISEE#88-1d-fc-41-ac-50:-100
If you want to keep data grouped after column explosion, I'd assign an ID for each group of entries first:
data['Group'] = pd.factorize(data['WifiInfo'])[0]+1
SSID BSSID Signal WifiInfo Group
0 NaN NaN NaN ODISEE#88-1d-fc-41-dc-50:-83,ODISEE#88- ... 1
1 NaN NaN NaN ASD#22-1d-fc-41-dc-50:-83,QWERTY#88- ... 2
# after you explode the column
SSID BSSID Signal WifiInfo Group
ODISEE 88-1d-fc-41-dc-50 -83 ODISEE#88-1d-fc-41-dc-50:-83 1
ODISEE 88-1d-fc-2c-c0-00 -72 ODISEE#88-1d-fc-2c-c0-00:-72 1
...
...
ASD 22-1d-fc-41-dc-50 -83 ASD#88-1d-fc-41-dc-50:-83 2
QWERTY 88-1d-fc-2c-c0-00 -72 QWERTY#88-1d-fc-2c-c0-00:-72 2

Append records to bottom of dataframe

I have two dataframes like the ones sampled below. I'm trying to append the records from one of the dataframes to the bottom of the first. So the final data frame should only have two columns. Instead I seem to be appending the columns from one dataframe on to the right side of the first. Does anyone see what I'm doing wrong?
Code:
appendDf=df1.append(df2)
df1
28343 \
0 42267
1 157180
2 186320
https://s.m.com/is/ime/M/ts/mized/5_fpx.tif
0 https://sl.com/is/i/M/...
1 https://sl.com/is/i/M/…
2 https://sl.com/is/im/M/...
df2
454 \
0 223
1 155
2 334
https://s.m.com/is/ime/M/ts/mized/5.tif
0 https://slret.com/is/i/M/...
1 https://slfdsd.com/is/i/M/…
2 https://slfd.com/is/im/M/...
appendDf.head()
28343 https://s.m.com/is/ime/M/ts/mized/5_fpx.tif 454 https://s.m.com/is/ime/M/ts/mized/5.tif
Your DataFrames do not seem to have column headers (I imagine the first row of your data is being used as the column headers), which is likely the root of your issue. When you append the second DataFrame, the program doesn't know which columns the data correspond to, so it adds them as new columns. See the following example:
import pandas as pd
df1 = pd.DataFrame([[28343, 'http://link1'], [42267, 'http://link2'],
[157180, 'http://link3'], [186320, 'http://link4']], columns=['ID','Link'])
df2 = pd.DataFrame([[454, 'http://link5'], [223, 'http://link6'],
[155, 'http://link7'], [334, 'http://link8']])
appendedDF = df1.append(df2)
Yields:
ID Link 0 1
0 28343.0 http://link1 NaN NaN
1 42267.0 http://link2 NaN NaN
2 157180.0 http://link3 NaN NaN
3 186320.0 http://link4 NaN NaN
0 NaN NaN 454.0 http://link5
1 NaN NaN 223.0 http://link6
2 NaN NaN 155.0 http://link7
3 NaN NaN 334.0 http://link8
Correct implementation:
import pandas as pd
df1 = pd.DataFrame([[28343, 'http://link1'], [42267, 'http://link2'],
[157180, 'http://link3'], [186320, 'http://link4']], columns=['ID','Link'])
df2 = pd.DataFrame([[454, 'http://link5'], [223, 'http://link6'],
[155, 'http://link7'], [334, 'http://link8']], columns=['ID','Link'])
appendedDF = df1.append(df2).reset_index(drop=True)
Yields:
ID Link
0 28343 http://link1
1 42267 http://link2
2 157180 http://link3
3 186320 http://link4
4 454 http://link5
5 223 http://link6
6 155 http://link7
7 334 http://link8

Categories

Resources