Pandas read_json, multiple lines issue

Pandas read_json, multiple lines issue - python

Iam trying to read a json file, however I get a "trailing data" ValueError with
df = pd.read_json(pathto\json')
I learned that I have to use the lines=TRUE argument, however this leads to a strange combination between a dataframe and JSON:
df = pd.read_json(pathto\json', lines=True)
print(df)
0 ... 6128
0 {'rt': 6014.89999999851, 'stimulus': 'Welcome ... ... None
1 {'rt': 1458.9000000003725, 'stimulus': 'Welcom... ... {'view_history': [{'page_index': 0, 'viewing_t...
2 {'rt': 5663.199999988079, 'stimulus': 'Welcome... ... None
3 {'rt': 2920.300000011921, 'stimulus': 'Welcome... ... None
[4 rows x 6129 columns]
Does somebody know who to solve this? I also tried manually removing linebreaks, but it leads to the same result.
Edit: Also, if I only read one line of the file, it loads properly (works for every single line):
with open('pathto\json', 'r') as f:
data = f.readlines()[3]
print(pd.read_json(data))
rt ... followsCheck
0 2920.3 ... NaN
1 90552.4 ... NaN
2 6501.3 ... NaN
3 77964.3 ... NaN
4 NaN ... NaN
... ... ...
6056 3990.6 ... NaN
6057 2323.0 ... NaN
6058 NaN ... NaN
6059 11882.9 ... NaN
6060 26112.4 ... NaN
[6061 rows x 40 columns]

Solved: File included multiple JSON-objects.
with open('pathto\json', 'r') as f:
data = f.readlines()
data = map(lambda x: pd.read_json(x), data)
df = pd.concat(data)
print(df)
rt ... followsCheck
0 6014.9 ... NaN
1 4832.0 ... NaN
2 4801.8 ... NaN
3 41796.2 ... NaN
4 NaN ... NaN
... ... ...
6056 3990.6 ... NaN
6057 2323.0 ... NaN
6058 NaN ... NaN
6059 11882.9 ... NaN
6060 26112.4 ... NaN
[24323 rows x 40 columns]
Thank you, guys! =)

Related

Pandas dataframe merge row by addition

I want to create a dataframe from census data. I want to calculate the number of people that returned a tax return for each specific earnings group.
For now, I wrote this
census_df = pd.read_csv('../zip code data/19zpallagi.csv')
sub_census_df = census_df[['zipcode', 'agi_stub', 'N02650', 'A02650', 'ELDERLY', 'A07180']].copy()
num_of_returns = ['Number_of_returns_1_25000', 'Number_of_returns_25000_50000', 'Number_of_returns_50000_75000',
'Number_of_returns_75000_100000', 'Number_of_returns_100000_200000', 'Number_of_returns_200000_more']
for i, column_name in zip(range(1, 7), num_of_returns):
sub_census_df[column_name] = sub_census_df[sub_census_df['agi_stub'] == i]['N02650']
I have 6 groups attached to a specific zip code. I want to get one row, with the number of returns for a specific zip code appearing just once as a column. I already tried to change NaNs to 0 and to use groupby('zipcode').sum(), but I get 50 million rows summed for zip code 0, where it seems that only around 800k should exist.
Here is the dataframe that I currently get:
zipcode agi_stub N02650 A02650 ELDERLY A07180 Number_of_returns_1_25000 Number_of_returns_25000_50000 Number_of_returns_50000_75000 Number_of_returns_75000_100000 Number_of_returns_100000_200000 Number_of_returns_200000_more Amount_1_25000 Amount_25000_50000 Amount_50000_75000 Amount_75000_100000 Amount_100000_200000 Amount_200000_more
0 0 1 778140.0 10311099.0 144610.0 2076.0 778140.0 NaN NaN NaN NaN NaN 10311099.0 NaN NaN NaN NaN NaN
1 0 2 525940.0 19145621.0 113810.0 17784.0 NaN 525940.0 NaN NaN NaN NaN NaN 19145621.0 NaN NaN NaN NaN
2 0 3 285700.0 17690402.0 82410.0 9521.0 NaN NaN 285700.0 NaN NaN NaN NaN NaN 17690402.0 NaN NaN NaN
3 0 4 179070.0 15670456.0 57970.0 8072.0 NaN NaN NaN 179070.0 NaN NaN NaN NaN NaN 15670456.0 NaN NaN
4 0 5 257010.0 35286228.0 85030.0 14872.0 NaN NaN NaN NaN 257010.0 NaN NaN NaN NaN NaN 35286228.0 NaN
And here is what I want to get:
zipcode Number_of_returns_1_25000 Number_of_returns_25000_50000 Number_of_returns_50000_75000 Number_of_returns_75000_100000 Number_of_returns_100000_200000 Number_of_returns_200000_more
0 0 778140.0 525940.0 285700.0 179070.0 257010.0 850.0

here is one way to do it using groupby and sum the desired columns
num_of_returns = ['Number_of_returns_1_25000', 'Number_of_returns_25000_50000', 'Number_of_returns_50000_75000',
'Number_of_returns_75000_100000', 'Number_of_returns_100000_200000', 'Number_of_returns_200000_more']
df.groupby('zipcode', as_index=False)[num_of_returns].sum()
zipcode Number_of_returns_1_25000 Number_of_returns_25000_50000 Number_of_returns_50000_75000 Number_of_returns_75000_100000 Number_of_returns_100000_200000 Number_of_returns_200000_more
0 0 778140.0 525940.0 285700.0 179070.0 257010.0 0.0

This question needs more information to actually give a proper answer. For example you leave out what is meant by certain columns in your data frame:
- `N1: Number of returns`
- `agi_stub: Size of adjusted gross income`
According to IRS this has the following levels.
Size of adjusted gross income "0 = No AGI Stub
1 = ‘Under $1’
2 = '$1 under $10,000' 3 = '$10,000 under $25,000' 4 = '$25,000 under $50,000' 5 = '$50,000 under $75,000' 6 = '$75,000 under $100,000' 7 = '$100,000 under $200,000'
8 = ‘$200,000 under $500,000’
9 = ‘$500,000 under $1,000,000’
10 = ‘$1,000,000 or more’"
I got the above from https://www.irs.gov/pub/irs-soi/16incmdocguide.doc
With this information, I think what you want to find is the number of
people who filed a tax return for each of the income levels of agi_stub.
If that is what you mean then, this can be achieved by:
import pandas as pd
data = pd.read_csv("./data/19zpallagi.csv")
## select only the desired columns
data = data[['zipcode', 'agi_stub', 'N1']]
## solution to your problem?
df = data.pivot_table(
index='zipcode',
values='N1',
columns='agi_stub',
aggfunc=['sum']
)
## bit of cleaning up.
PREFIX = 'agi_stub_level_'
df.columns = [PREFIX + level for level in df.columns.get_level_values(1).astype(str)]
Here's the output.
In [77]: df
Out[77]:
agi_stub_level_1 agi_stub_level_2 ... agi_stub_level_5 agi_stub_level_6
zipcode ...
0 50061850.0 37566510.0 ... 21938920.0 8859370.0
1001 2550.0 2230.0 ... 1420.0 230.0
1002 2850.0 1830.0 ... 1840.0 990.0
1005 650.0 570.0 ... 450.0 60.0
1007 1980.0 1530.0 ... 1830.0 460.0
... ... ... ... ... ...
99827 470.0 360.0 ... 170.0 40.0
99833 550.0 380.0 ... 290.0 80.0
99835 1250.0 1130.0 ... 730.0 190.0
99901 1960.0 1520.0 ... 1030.0 290.0
99999 868450.0 644160.0 ... 319880.0 142960.0
[27595 rows x 6 columns]

Text to column task in Python

;
;
ACHTUNG;Dies ist das Ergebnis einer Testversion. Alle Ergebnisse ohne Gewaehr.
;Bei Rueckfragen oder Unstimmigkeiten wenden Sie sich an aron.proebsting#mwtest.de;
;
;
;
PSD4_Status;|;
PSD5_Status;|;
mux;<-;PSD6_CAN;PSD6_Status;
PSD6_Status;|;
cycle_state;<-;PSD6_Status;PSD5_Status;PSD4_Status;
PsdEhr_out;<-;PsdEhr_ProcessMessageCycle();PSD6_CAN;PSD6_Status;PSD5_Status;cycle_state;PSD4_Status;
Entfernung_Abzweigung;<-;Aktuelle_Pos.inhibitTime;Aktuelle_Pos.id;Aktuelle_Pos.lane;Aktuelle_Pos.longitudinalError;Aktuelle_Pos.isLocationUnique;Aktuelle_Pos.length;Child_Segment.geometry.curvatureStart;Child_Segment.geometry.curvatureEnd;Child_Segment.geometry.branchAngle;Child_Segment.attributes.lanes;Child_Segment.attributes.streetClass;Child_Segment.attributes.ramp;Child_Segment.attributes.isMostProbablePath;Child_Segment.attributes.isStraightestPath;Child_Segment.attributes.isADASQuality;Child_Segment.attributes.isBuiltUpArea;Child_Segment.attributeIndex;Child_Segment.speedLimitIndex;Child_Segment.id;Child_Segment.parentId;Child_Segment.identity;Child_Segment.completeFlags;Child_Segment.childSegments[0];Child_Segment.childSegments[1];Child_Segment.childSegments[2];Child_Segment.childSegments[3];Child_Segment.childSegments[4];Get_Child_It.indexStart;Get_Child_It.indexCurrent;Get_Child_It.id;Aktuelles_Segment.geometry.curvatureStart;Aktuelles_Segment.geometry.curvatureEnd;Aktuelles_Segment.geometry.length;Aktuelles_Segment.geometry.branchAngle;Aktuelles_Segment.attributes.lanes;Aktuelles_Segment.attributes.streetClass;Aktuelles_Segment.attributes.ramp;Aktuelles_Segment.attributes.isMostProbablePath;Aktuelles_Segment.attributes.isStraightestPath;Aktuelles_Segment.attributes.isADASQuality;Aktuelles_Segment.attributes.isBuiltUpArea;Aktuelles_Segment.attributeIndex;Aktuelles_Segment.speedLimitIndex;Aktuelles_Segment.id;Aktuelles_Segment.parentId;Aktuelles_Segment.identity;Aktuelles_Segment.completeFlags;Aktuelles_Segment.childSegments[1];Aktuelles_Segment.childSegments[2];Aktuelles_Segment.childSegments[3];Aktuelles_Segment.childSegments[4];Child_from_Parent.geometry.curvatureStart;Child_from_Parent.geometry.curvatureEnd;Child_from_Parent.geometry.length;Child_from_Parent.geometry.branchAngle;Child_from_Parent.attributes.lanes;Child_from_Parent.attributes.ramp;Child_from_Parent.attributes.isMostProbablePath;Child_from_Parent.attributes.isStraightestPath;Child_from_Parent.attributes.isADASQuality;Child_from_Parent.attributes.isBuiltUpArea;Child_from_Parent.attributeIndex;Child_from_Parent.speedLimitIndex;Child_from_Parent.id;Child_from_Parent.parentId;Child_from_Parent.identity;Child_from_Parent.completeFlags;Child_from_Parent.childSegments[0];Child_from_Parent.childSegments[1];Child_from_Parent.childSegments[2];Child_from_Parent.childSegments[3];Child_from_Parent.childSegments[4];Min_Strassenklasse;Aktuelles_Segment.childSegments[0];Child_Segment.geometry.length;Child_from_Parent.attributes.streetClass;PsdEhr_ProcessMessageCycle();PSD6_CAN;PSD6_Status;PSD5_Status;cycle_state;PSD4_Status;
Steigung_gueltig;<-;Aktuelle_Pos.length;Aktuelle_Pos.inhibitTime;Aktuelle_Pos.lane;Aktuelle_Pos.longitudinalError;Aktuelle_Pos.isLocationUnique;Aktuelles_Segment.geometry.curvatureStart;Aktuelles_Segment.geometry.curvatureEnd;Aktuelles_Segment.geometry.branchAngle;Aktuelles_Segment.attributes.lanes;Aktuelles_Segment.attributes.streetClass;Aktuelles_Segment.attributes.ramp;Aktuelles_Segment.attributes.isMostProbablePath;Aktuelles_Segment.attributes.isStraightestPath;Aktuelles_Segment.attributes.isADASQuality;Aktuelles_Segment.attributes.isBuiltUpArea;Aktuelles_Segment.attributeIndex;Aktuelles_Segment.speedLimitIndex;Aktuelles_Segment.id;Aktuelles_Segment.parentId;Aktuelles_Segment.identity;Aktuelles_Segment.completeFlags;Aktuelles_Segment.childSegments[0];Aktuelles_Segment.childSegments[1];Aktuelles_Segment.childSegments[2];Aktuelles_Segment.childSegments[3];Aktuelles_Segment.childSegments[4];Aktuelle_Pos.id;Suchweite;Steigung_innerhalb_Suchweite.distance;Steigung_innerhalb_Suchweite.attribute.nextAttribute;Steigung_innerhalb_Suchweite.attribute.offset;Steigung_innerhalb_Suchweite.attribute.type;Steigung_innerhalb_Suchweite.segmentId;Steigung_innerhalb_Suchweite_It.searchDistance;Steigung_innerhalb_Suchweite_It.currentIndex;Steigung_innerhalb_Suchweite_It.currentDistance;Steigung_innerhalb_Suchweite_It.searchType;Steigung_innerhalb_Suchweite_It.searchDirection;Steigung_innerhalb_Suchweite_It.currentId;Steigung_innerhalb_Suchweite_It.currentOffset;Steigung_innerhalb_Suchweite.attribute.value;Aktuelles_Segment.geometry.length;PsdEhr_ProcessMessageCycle();PSD6_CAN;PSD6_Status;PSD5_Status;cycle_state;PSD4_Status;
Kruemmung_gueltig;<-;Aktuelle_Pos.length;Aktuelle_Pos.inhibitTime;Aktuelle_Pos.lane;Aktuelle_Pos.longitudinalError;Aktuelle_Pos.isLocationUnique;Aktuelles_Segment.geometry.curvatureStart;Aktuelles_Segment.geometry.curvatureEnd;Aktuelles_Segment.geometry.length;Aktuelles_Segment.geometry.branchAngle;Aktuelles_Segment.attributes.lanes;Aktuelles_Segment.attributes.streetClass;Aktuelles_Segment.attributes.ramp;Aktuelles_Segment.attributes.isMostProbablePath;Aktuelles_Segment.attributes.isStraightestPath;Aktuelles_Segment.attributes.isADASQuality;Aktuelles_Segment.attributes.isBuiltUpArea;Aktuelles_Segment.attributeIndex;Aktuelles_Segment.speedLimitIndex;Aktuelles_Segment.id;Aktuelles_Segment.parentId;Aktuelles_Segment.identity;Aktuelles_Segment.completeFlags;Aktuelles_Segment.childSegments[0];Aktuelles_Segment.childSegments[1];Aktuelles_Segment.childSegments[2];Aktuelles_Segment.childSegments[3];Aktuelles_Segment.childSegments[4];Aktuelle_Pos.id;PsdEhr_ProcessMessageCycle();PSD6_CAN;PSD6_Status;PSD5_Status;cycle_state;PSD4_Status;
BuiltUpArea;<-;Aktuelle_Pos.length;Aktuelle_Pos.inhibitTime;Aktuelle_Pos.lane;Aktuelle_Pos.longitudinalError;Aktuelle_Pos.isLocationUnique;Aktuelles_Segment.geometry.curvatureStart;Aktuelles_Segment.geometry.curvatureEnd;Aktuelles_Segment.geometry.length;Aktuelles_Segment.geometry.branchAngle;Aktuelles_Segment.attributes.lanes;Aktuelles_Segment.attributes.streetClass;Aktuelles_Segment.attributes.ramp;Aktuelles_Segment.attributes.isMostProbablePath;Aktuelles_Segment.attributes.isStraightestPath;Aktuelles_Segment.attributes.isADASQuality;Aktuelles_Segment.attributeIndex;Aktuelles_Segment.speedLimitIndex;Aktuelles_Segment.id;Aktuelles_Segment.parentId;Aktuelles_Segment.identity;Aktuelles_Segment.completeFlags;Aktuelles_Segment.childSegments[0];Aktuelles_Segment.childSegments[1];Aktuelles_Segment.childSegments[2];Aktuelles_Segment.childSegments[3];Aktuelles_Segment.childSegments[4];Aktuelles_Segment.attributes.isBuiltUpArea;Aktuelle_Pos.id;PsdEhr_ProcessMessageCycle();PSD6_CAN;PSD6_Status;PSD5_Status;cycle_state;PSD4_Status;
ADASQuality;<-;Aktuelle_Pos.length;Aktuelle_Pos.inhibitTime;Aktuelle_Pos.lane;Aktuelle_Pos.longitudinalError;Aktuelle_Pos.isLocationUnique;Aktuelles_Segment.geometry.curvatureStart;Aktuelles_Segment.geometry.curvatureEnd;Aktuelles_Segment.geometry.length;Aktuelles_Segment.geometry.branchAngle;Aktuelles_Segment.attributes.lanes;Aktuelles_Segment.attributes.streetClass;Aktuelles_Segment.attributes.ramp;Aktuelles_Segment.attributes.isMostProbablePath;Aktuelles_Segment.attributes.isStraightestPath;Aktuelles_Segment.attributes.isBuiltUpArea;Aktuelles_Segment.attributeIndex;Aktuelles_Segment.speedLimitIndex;Aktuelles_Segment.id;Aktuelles_Segment.parentId;Aktuelles_Segment.identity;Aktuelles_Segment.completeFlags;Aktuelles_Segment.childSegments[0];Aktuelles_Segment.childSegments[1];Aktuelles_Segment.childSegments[2];Aktuelles_Segment.childSegments[3];Aktuelles_Segment.childSegments[4];Aktuelles_Segment.attributes.isADASQuality;Aktuelle_Pos.id;PsdEhr_ProcessMessageCycle();PSD6_CAN;PSD6_Status;PSD5_Status;cycle_state;PSD4_Status;
NumberOfChilds;<-;Aktuelle_Pos.length;Aktuelle_Pos.inhibitTime;Aktuelle_Pos.lane;Aktuelle_Pos.longitudinalError;Aktuelle_Pos.isLocationUnique;Aktuelles_Segment.geometry.curvatureStart;Aktuelles_Segment.geometry.curvatureEnd;Aktuelles_Segment.geometry.length;Aktuelles_Segment.geometry.branchAngle;Aktuelles_Segment.attributes.lanes;Aktuelles_Segment.attributes.streetClass;Aktuelles_Segment.attributes.ramp;Aktuelles_Segment.attributes.isMostProbablePath;Aktuelles_Segment.attributes.isStraightestPath;Aktuelles_Segment.attributes.isADASQuality;Aktuelles_Segment.attributes.isBuiltUpArea;Aktuelles_Segment.attributeIndex;Aktuelles_Segment.speedLimitIndex;Aktuelles_Segment.id;Aktuelles_Segment.parentId;Aktuelles_Segment.identity;Aktuelles_Segment.completeFlags;Aktuelles_Segment.childSegments[1];Aktuelles_Segment.childSegments[2];Aktuelles_Segment.childSegments[3];Aktuelles_Segment.childSegments[4];Child_from_Parent.geometry.curvatureStart;Child_from_Parent.geometry.curvatureEnd;Child_from_Parent.geometry.length;Child_from_Parent.geometry.branchAngle;Child_from_Parent.attributes.lanes;Child_from_Parent.attributes.ramp;Child_from_Parent.attributes.isMostProbablePath;Child_from_Parent.attributes.isStraightestPath;Child_from_Parent.attributes.isADASQuality;Child_from_Parent.attributes.isBuiltUpArea;Child_from_Parent.attributeIndex;Child_from_Parent.speedLimitIndex;Child_from_Parent.id;Child_from_Parent.parentId;Child_from_Parent.identity;Child_from_Parent.completeFlags;Child_from_Parent.childSegments[0];Child_from_Parent.childSegments[1];Child_from_Parent.childSegments[2];Child_from_Parent.childSegments[3];Child_from_Parent.childSegments[4];Min_Strassenklasse;Aktuelles_Segment.childSegments[0];Child_Segment.geometry.curvatureStart;Child_Segment.geometry.curvatureEnd;Child_Segment.geometry.length;Child_Segment.geometry.branchAngle;Child_Segment.attributes.lanes;Child_Segment.attributes.streetClass;Child_Segment.attributes.ramp;Child_Segment.attributes.isMostProbablePath;Child_Segment.attributes.isStraightestPath;Child_Segment.attributes.isADASQuality;Child_Segment.attributes.isBuiltUpArea;Child_Segment.attributeIndex;Child_Segment.speedLimitIndex;Child_Segment.parentId;Child_Segment.identity;Child_Segment.completeFlags;Child_Segment.childSegments[0];Child_Segment.childSegments[1];Child_Segment.childSegments[2];Child_Segment.childSegments[3];Child_Segment.childSegments[4];Get_Child_It.indexStart;Get_Child_It.indexCurrent;Get_Child_It.id;Child_Segment.id;Child_from_Parent.attributes.streetClass;Aktuelle_Pos.id;PsdEhr_ProcessMessageCycle();PSD6_CAN;PSD6_Status;PSD5_Status;cycle_state;PSD4_Status;
This is how my csv file currently looks. I want to create the file like we do in excel text to column with ; seperator. I cannot do it in excel as I want to automize this process because there are many files like this. I am new to python so not sure how to proceed ahead. Some suggestion would be really helpful.

Do you have to keep every row in your csv file? This will be a slight problem because you do not have enough delimiters per row to account for each column. This code will open your file, check how many delimiters each row needs, add the appropriate number of delimiters, save the new csv file with those delimiters, then open the new csv file using Pandas csv_read:
import pandas as pd
path = "Text.csv"
text = [f for f in open(path)]
# Find the maximum number of delimiters (;) in any given row
numDelims = []
for line in text:
count = line.count(';')
numDelims.append(count)
maxDelims = np.max(numDelims)
# Add the missing number of delimiters to each row to account for the columns
for x in range(len(text)):
text[x] = text[x].replace("\n", ";"*(maxDelims-numDelims[x])+"\n")
# Save the new csv file with all the additional delimters
newFile = "Save.csv"
# Save to a new text file
with open(newFile, "w+") as file:
file.writelines(text)
# Read the file back in as a pandas dataframe
df = pd.read_csv("Save.csv", sep=";")
df

You can try to read your file using pandas (pandas.read_csv), such as:
import pandas as pd
pd.read_csv('pathofyourfile', sep=';')

You can use read_csv with the sep and skiprows parameters:
import pandas as pd
df = pd.read_csv('test.csv', sep=';', skiprows=13)
print(df)
Output:
Entfernung_Abzweigung <- Aktuelle_Pos.inhibitTime \
0 Steigung_gueltig <- Aktuelle_Pos.length
1 Kruemmung_gueltig <- Aktuelle_Pos.length
2 BuiltUpArea <- Aktuelle_Pos.length
3 ADASQuality <- Aktuelle_Pos.length
4 NumberOfChilds <- Aktuelle_Pos.length
Aktuelle_Pos.id Aktuelle_Pos.lane \
0 Aktuelle_Pos.inhibitTime Aktuelle_Pos.lane
1 Aktuelle_Pos.inhibitTime Aktuelle_Pos.lane
2 Aktuelle_Pos.inhibitTime Aktuelle_Pos.lane
3 Aktuelle_Pos.inhibitTime Aktuelle_Pos.lane
4 Aktuelle_Pos.inhibitTime Aktuelle_Pos.lane
Aktuelle_Pos.longitudinalError Aktuelle_Pos.isLocationUnique \
0 Aktuelle_Pos.longitudinalError Aktuelle_Pos.isLocationUnique
1 Aktuelle_Pos.longitudinalError Aktuelle_Pos.isLocationUnique
2 Aktuelle_Pos.longitudinalError Aktuelle_Pos.isLocationUnique
3 Aktuelle_Pos.longitudinalError Aktuelle_Pos.isLocationUnique
4 Aktuelle_Pos.longitudinalError Aktuelle_Pos.isLocationUnique
Aktuelle_Pos.length \
0 Aktuelles_Segment.geometry.curvatureStart
1 Aktuelles_Segment.geometry.curvatureStart
2 Aktuelles_Segment.geometry.curvatureStart
3 Aktuelles_Segment.geometry.curvatureStart
4 Aktuelles_Segment.geometry.curvatureStart
Child_Segment.geometry.curvatureStart \
0 Aktuelles_Segment.geometry.curvatureEnd
1 Aktuelles_Segment.geometry.curvatureEnd
2 Aktuelles_Segment.geometry.curvatureEnd
3 Aktuelles_Segment.geometry.curvatureEnd
4 Aktuelles_Segment.geometry.curvatureEnd
Child_Segment.geometry.curvatureEnd ... \
0 Aktuelles_Segment.geometry.branchAngle ...
1 Aktuelles_Segment.geometry.length ...
2 Aktuelles_Segment.geometry.length ...
3 Aktuelles_Segment.geometry.length ...
4 Aktuelles_Segment.geometry.length ...
Aktuelles_Segment.childSegments[0] \
0 NaN
1 NaN
2 NaN
3 NaN
4 Child_Segment.id
Child_Segment.geometry.length \
0 NaN
1 NaN
2 NaN
3 NaN
4 Child_from_Parent.attributes.streetClass
Child_from_Parent.attributes.streetClass PsdEhr_ProcessMessageCycle() \
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 Aktuelle_Pos.id PsdEhr_ProcessMessageCycle()
PSD6_CAN PSD6_Status PSD5_Status cycle_state PSD4_Status Unnamed: 84
0 NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN
4 PSD6_CAN PSD6_Status PSD5_Status cycle_state PSD4_Status NaN

Rolling window produces no effect on dataframe

So I have to perform a rolling window to a set of rows inside a dataframe. The problem is that when I do full_df = full_df.rolling(window=5).mean() the output of full_df.head(2000) shows all NaN values. Does anyone know why this happens? I have to perform a time series exercise with this.
This is the dataset: https://github.com/plotly/datasets/blob/master/all_stocks_5yr.csv
This is what I have:
df = pd.read_csv('all_stocks_5yr.csv', usecols=["date", "close",
"Name"])
gp = df.groupby("Name")
my_dict = {key: group['close'].to_numpy() for key, group in gp}
full_df = pd.DataFrame.from_dict(my_dict, orient='index')
for i in full_df:
full_df = full_df.rolling(window=5).mean()
An image of the output:

First off, your loop for i in full_df is not doing what you think; instead of running the rolling mean in each row, you're running it over and over again on the whole dataframe, averaging along columns.
If we just do the rolling average once the way you're implemting it:
full_df = full_df.rolling(window=5).mean()
print(full_df)
0 1 2 3 ... 1255 1256 1257 1258
A NaN NaN NaN NaN ... NaN NaN NaN NaN
AAL NaN NaN NaN NaN ... NaN NaN NaN NaN
AAP NaN NaN NaN NaN ... NaN NaN NaN NaN
AAPL NaN NaN NaN NaN ... NaN NaN NaN NaN
ABBV 48.56684 48.37228 47.95056 48.07312 ... 102.590 98.768 101.212 100.510
... ... ... ... ... ... ... ... ... ...
XYL 45.58400 45.60000 45.74000 45.96200 ... 64.504 61.854 61.596 61.036
YUM 51.14200 51.01800 51.17400 51.28400 ... 66.902 64.420 63.914 63.668
ZBH 48.59000 48.49200 48.57000 48.75000 ... 75.154 73.112 72.704 72.436
ZION 44.84400 44.76600 44.89400 45.08200 ... 73.972 71.734 71.516 71.580
ZTS 45.08600 45.02600 45.27400 45.39200 ... 83.002 80.224 80.000 80.116
[505 rows x 1259 columns]
The first four rows are all NaN because the rolling mean isn't defined for fewer than 5 rows.
If we do it again (making a total of two times):
full_df = full_df.rolling(window=5).mean()
print(full_df.head(9))
0 1 2 ... 1256 1257 1258
A NaN NaN NaN ... NaN NaN NaN
AAL NaN NaN NaN ... NaN NaN NaN
AAP NaN NaN NaN ... NaN NaN NaN
AAPL NaN NaN NaN ... NaN NaN NaN
ABBV NaN NaN NaN ... NaN NaN NaN
ABC NaN NaN NaN ... NaN NaN NaN
ABT NaN NaN NaN ... NaN NaN NaN
ACN NaN NaN NaN ... NaN NaN NaN
ADBE 49.619072 49.471424 49.192048 ... 108.3420 110.4848 110.4976
You can see the first 8 rows are all NaN since the fourth row reaches down to the eighth in the rolling mean. Given the size of your data frame (505 rows) if you ran the rolling mean 127 times, the entire df would be consumed withNaN values, and your for loop is doing it even more times than that, which is why your df is filled with NaN values.
Also, note that you're averaging across different stock tickers, which doesn't make sense. What I believe you want to be doing is averaging the rows, not the columns in which case you simply need to do
full_df = full_df.rolling(axis = 'columns', window=5).mean()
print(full_df)
0 1 2 3 4 5 ... 1253 1254 1255 1256 1257 1258
A NaN NaN NaN NaN 44.72600 44.1600 ... 73.926 73.720 73.006 71.744 70.836 69.762
AAL NaN NaN NaN NaN 14.42600 14.3760 ... 53.142 53.308 53.114 52.530 52.248 51.664
AAP NaN NaN NaN NaN 78.74000 78.7600 ... 120.742 120.016 118.074 115.468 114.054 112.642
AAPL NaN NaN NaN NaN 67.32592 66.9025 ... 168.996 168.330 166.128 163.834 163.046 161.468
ABBV NaN NaN NaN NaN 35.87200 36.1380 ... 116.384 117.992 116.384 113.824 112.888 113.168
... ... ... ... ... ... ... ... ... ... ... ... ... ...
XYL NaN NaN NaN NaN 27.84600 28.0840 ... 73.278 73.598 73.848 73.698 73.350 73.256
YUM NaN NaN NaN NaN 64.58000 64.3180 ... 85.504 85.168 84.454 83.118 82.316 81.424
ZBH NaN NaN NaN NaN 75.85600 75.8660 ... 126.284 126.974 126.886 126.044 125.316 124.048
ZION NaN NaN NaN NaN 24.44200 24.4820 ... 53.838 54.230 54.256 53.748 53.466 53.464
ZTS NaN NaN NaN NaN 33.37400 33.5600 ... 78.720 78.434 77.772 76.702 75.686 75.112
Again, your first four columns are not managed here.
To correct for that, we add one more term:
full_df = full_df.rolling(axis = 'columns', window=5, min_periods = 1).mean()
print(full_df)
0 1 2 3 4 5 ... 1253 1254 1255 1256 1257 1258
A 45.0800 44.8400 44.766667 44.7625 44.72600 44.1600 ... 73.926 73.720 73.006 71.744 70.836 69.762
AAL 14.7500 14.6050 14.493333 14.5350 14.42600 14.3760 ... 53.142 53.308 53.114 52.530 52.248 51.664
AAP 78.9000 78.6450 78.630000 78.7150 78.74000 78.7600 ... 120.742 120.016 118.074 115.468 114.054 112.642
AAPL 67.8542 68.2078 67.752800 67.4935 67.32592 66.9025 ... 168.996 168.330 166.128 163.834 163.046 161.468
ABBV 36.2500 36.0500 35.840000 35.6975 35.87200 36.1380 ... 116.384 117.992 116.384 113.824 112.888 113.168
... ... ... ... ... ... ... ... ... ... ... ... ... ...
XYL 27.0900 27.2750 27.500000 27.6900 27.84600 28.0840 ... 73.278 73.598 73.848 73.698 73.350 73.256
YUM 65.3000 64.9250 64.866667 64.7525 64.58000 64.3180 ... 85.504 85.168 84.454 83.118 82.316 81.424
ZBH 75.8500 75.7500 75.646667 75.7350 75.85600 75.8660 ... 126.284 126.974 126.886 126.044 125.316 124.048
ZION 24.1400 24.1750 24.280000 24.3950 24.44200 24.4820 ... 53.838 54.230 54.256 53.748 53.466 53.464
ZTS 33.0500 33.1550 33.350000 33.4000 33.37400 33.5600 ... 78.720 78.434 77.772 76.702 75.686 75.112
In the above data frame the first column is just the value at time 0, the second is the average of times 0 and 1, the third is the average of times 0, 1, and 2, etc. The window size continues growing until you get to your value of window=5, at which point the window moves along with your rolling average. Note that you can also center the rolling mean if you want to rather than have a trailing window. You can see the documentation here.

I'm not quite sure what you are trying to do. Could you explain in more detail, what the goal of your operation is? I assume you try to build a moinving (rolling) average with a 5 day intervall across each asset and calculate the mean prices for each intervall.
But first, let me answer why you see all the NaNs:
What you are doing with this code below, is thare you are just doing the same operation over and over again and the result of it is always NaNs. That is, because you are doing something weird with the dict and the first rows all have NaNs so average will also be NaNs. And since you overwrite the variable full_df by the result of this computation, your dataframe shows only NaNs.
for i in full_df:
full_df = full_df.rolling(window=5).mean()
Let me explain in more detail. You were (probably) trying to iterate over the dataframe (using a window of 5 days) and compute the mean. The function full_df.rolling(window=5).mean() already does exactly that, and the output is a new dataframe, with the mean of each window over the entire datafrane full_df. By running this function in a loop, without additional indexing you are only running the same function across the entire dataframe over an over again.
Maybe this will get you what you want:
import pandas as pd
df = pd.read_csv("all_stocks_5yr.csv", index_col=[0,6])
means = df.rolling(window=5).mean()

Is there a possibility to use a bigger List in phython?

For school I have to make a project about wifisignals and I am trying put the data in a dataframe.
There are 208.000 rows of data.
And when it comes to the code below, the code does not complete. The code is like it is stuck in an infinite loop.
But when I use only a 1000 rows my program works. So I think that my list are to small if that is possible.
Do bigger Lists exist in phython? Or is it because I use bad coding?
Thanks in advance.
edit 1:
(data is the original dataframe and wifiinfo is a column of that)
i have this format:
df = pd.DataFrame(columns=['Sender','Time','Date','Place','X','Y','Bezetting','SSID','BSSID','Signal'])
And i am trying to fill SSID, BSSID and Signal from the Column WifiInfo for this i have to split the data.
this is how 1 WifiInfo looks like:
ODISEE#88-1d-fc-41-dc-50:-83,ODISEE#88-1d-fc-2c-c0-00:-72,ODISEE#88-1d-fc-41-d2-d0:-82,CiscoC5976#58-6d-8f-19-14-38:-78,CiscoC5959#58-6d-8f-19-13-f4:-93,SNB#c8-d7-19-6f-be-b7:-99,ODISEE#88-1d-fc-2c-c5-70:-94,HackingDemo#58-6d-8f-19-11-48:-156,ODISEE#88-1d-fc-30-d4-40:-85,ODISEE#88-1d-fc-41-ac-50:-100
My current approach looks like:
for index, row in data.iterrows():
bezettingList = list()
ssidList = list()
bssidList = list()
signalList = list()
#WifiInfo splitting
wifis = row.WifiInfo.split(',')
for wifi in wifis:
#split wifi and add to List
ssid, bssid = wifi.split('#')
bssid, signal = bssid.split(':')
ssidList.append(ssid)
bssidList.append(bssid)
signalList.append(int(signal))
#add bezettingen to List
bezettingen = row.Bezetting.split(',')
for bezetting in bezettingen:
bezettingList.append(bezetting)
#add list to dataframe
df.loc[index,'SSID'] = ssidList
df.loc[index,'BSSID'] = bssidList
df.loc[index,'Signal'] = signalList
df.loc[index,'Bezetting'] = bezettingList
df.head()

IIUC, you need to first explode the row by commas so that this:
SSID BSSID Signal WifiInfo
0 NaN NaN NaN ODISEE#88-1d-fc-41-dc-50:-83,ODISEE#88- ...
becomes this:
SSID BSSID Signal WifiInfo
0 NaN NaN NaN ODISEE#88-1d-fc-41-dc-50:-83
1 NaN NaN NaN ODISEE#88-1d-fc-2c-c0-00:-72
2 NaN NaN NaN ODISEE#88-1d-fc-41-d2-d0:-82
3 NaN NaN NaN CiscoC5976#58-6d-8f-19-14-38:-78
4 NaN NaN NaN CiscoC5959#58-6d-8f-19-13-f4:-93
5 NaN NaN NaN SNB#c8-d7-19-6f-be-b7:-99
6 NaN NaN NaN ODISEE#88-1d-fc-2c-c5-70:-94
7 NaN NaN NaN HackingDemo#58-6d-8f-19-11-48:-156
8 NaN NaN NaN ODISEE#88-1d-fc-30-d4-40:-85
9 NaN NaN NaN ODISEE#88-1d-fc-41-ac-50:-100
# use `.explode`
data = data.assign(WifiInfo=data.WifiInfo.str.split(',')).explode('WifiInfo')
Now you could use .str.extract:
data['SSID'] = data['WifiInfo'].str.extract(r'(.*)#')
data['BSSID'] = data['WifiInfo'].str.extract(r'#(.*):')
data['Signal'] = data['WifiInfo'].str.extract(r':(.*)')
SSID BSSID Signal WifiInfo
0 ODISEE 88-1d-fc-41-dc-50 -83 ODISEE#88-1d-fc-41-dc-50:-83
1 ODISEE 88-1d-fc-2c-c0-00 -72 ODISEE#88-1d-fc-2c-c0-00:-72
2 ODISEE 88-1d-fc-41-d2-d0 -82 ODISEE#88-1d-fc-41-d2-d0:-82
3 CiscoC5976 58-6d-8f-19-14-38 -78 CiscoC5976#58-6d-8f-19-14-38:-78
4 CiscoC5959 58-6d-8f-19-13-f4 -93 CiscoC5959#58-6d-8f-19-13-f4:-93
5 SNB c8-d7-19-6f-be-b7 -99 SNB#c8-d7-19-6f-be-b7:-99
6 ODISEE 88-1d-fc-2c-c5-70 -94 ODISEE#88-1d-fc-2c-c5-70:-94
7 HackingDemo 58-6d-8f-19-11-48 -156 HackingDemo#58-6d-8f-19-11-48:-156
8 ODISEE 88-1d-fc-30-d4-40 -85 ODISEE#88-1d-fc-30-d4-40:-85
9 ODISEE 88-1d-fc-41-ac-50 -100 ODISEE#88-1d-fc-41-ac-50:-100
If you want to keep data grouped after column explosion, I'd assign an ID for each group of entries first:
data['Group'] = pd.factorize(data['WifiInfo'])[0]+1
SSID BSSID Signal WifiInfo Group
0 NaN NaN NaN ODISEE#88-1d-fc-41-dc-50:-83,ODISEE#88- ... 1
1 NaN NaN NaN ASD#22-1d-fc-41-dc-50:-83,QWERTY#88- ... 2
# after you explode the column
SSID BSSID Signal WifiInfo Group
ODISEE 88-1d-fc-41-dc-50 -83 ODISEE#88-1d-fc-41-dc-50:-83 1
ODISEE 88-1d-fc-2c-c0-00 -72 ODISEE#88-1d-fc-2c-c0-00:-72 1
...
...
ASD 22-1d-fc-41-dc-50 -83 ASD#88-1d-fc-41-dc-50:-83 2
QWERTY 88-1d-fc-2c-c0-00 -72 QWERTY#88-1d-fc-2c-c0-00:-72 2

Have a dataframe restart it's count for a particular column

The Overview:
In our project, we are working with a CSV file that contains some data. We will call it smal.csv It is a bit of a chunky file that will be later used for some other algorithms. (Here is the gist in case the link to smal.csv is too badly formatted for your browser.)
The file will be loaded like this
filename = "smal.csv"
keyname = "someKeyname"
self.data[keyname] = spectral_data(pd.read_csv(filename, header=[0, 1], verbose=True))
The spectral class looks like this. As you can see, we do not actually keep the dataframe as is.
class spectral_data(object):
def __init__(self, df):
try:
uppercols = df.columns.levels[0]
lowercols = list(df.columns.levels[1].values)
except:
df.columns = pd.MultiIndex.from_tuples(list(df.columns))
uppercols = df.columns.levels[0]
lowercols = list(df.columns.levels[1].values)
for i, val in enumerate(lowercols):
try:
lowercols[i] = float(val)
except:
lowercols[i] = val
levels = [uppercols, lowercols]
df.columns.set_levels(levels, inplace=True)
self.df = df
After we've loaded it we'd like to concatenate it with another set of data, also loaded like smal.csv was.
Our concatenation is done like this.
new_df = pd.concat([self.data[dataSet1].df, self.data[dataSet2].df], ignore_index=True)
However, the ignore_index=True does not work, because the actual row that we are concatenating is not the index. However, we cannot simply remove the column, it is necessary for other parts of our program.
The Objective:
I'm trying to concatenate a couple of data frames together, however, what I thought was the index is not actually the index for the data frame. Thus the command
pd.concat([df1.df, df2.df], ignore_index=True)
will not work. I thought maybe using iloc to change each individual cell would work but I feel like this is not the most intuitive way to approach this.
How can I get a data frame that looks like this
[396 rows x 6207 columns]
Unnamed: 0_level_0 meta ... wvl
Unnamed: 0_level_1 Sample ... 932.695 932.89
0 1 NaN ... -12.33 9.67
1 2 NaN ... 11.94 3.94
2 3 NaN ... -2.67 28.33
3 4 NaN ... 53.22 -13.78
4 1 NaN ... 43.28
5 2 NaN ... 41.33 47.33
6 3 NaN ... -21.94 12.06
7 4 NaN ... -30.94 -1.94
8 5 NaN ... -24.78 40.22
Turn into this.
[396 rows x 6207 columns]
Unnamed: 0_level_0 meta ... wvl
Unnamed: 0_level_1 Sample ... 932.695 932.89
0 1 NaN ... -12.33 9.67
1 2 NaN ... 11.94 3.94
2 3 NaN ... -2.67 28.33
3 4 NaN ... 53.22 -13.78
4 5 NaN ... 43.28
5 6 NaN ... 41.33 47.33
6 7 NaN ... -21.94 12.06
7 8 NaN ... -30.94 -1.94
8 9 NaN ... -24.78 40.22

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas read_json, multiple lines issue - python

Related

Pandas dataframe merge row by addition

Text to column task in Python

Rolling window produces no effect on dataframe

Is there a possibility to use a bigger List in phython?

Have a dataframe restart it's count for a particular column

Categories

Resources