Trying to salvage a messed up dataset - python
I've been tasked with the tedious job of salvaging a mangled dataset that looks like this in its .txt file form
"DATE"|"ID"|"TITLE"|"NAME"|"SURNAME"|"INCOME"|"ADDRESS1"|"JOB"
||MR|ROBIN|WILLIAMS|291029102|
2019-01-01||MRS||DREW||SECOND_TO_THE_RIGHT_AND_STRAIGHT_ON_TILL_MORNING||
2021-02-04|||JONATHAN|SIMMONS|3012000|SOMEWHERE_IN_BEVERLY_HILLS|
|MARK|ZUCKERBURG||SILICON_VALLEY|
|||PARKER|JONES||SOMEWHERE_IN_HIS_HOMETOWN_IN_
NORTHERN_ITALY
2019-06-03|4444301243451|MRS|CARMEN|SANDIEGO||WHERE_IN_THE_WORLD_IS_
CARMEN_SANDIEGO|THIEF
2018-02-25|2395022812223|MR|ARNOLD|SCHWARZENEGGER|MUSCLE_BEACH_VENI
CE_CALIFORNIA_USA|ACTOR
|DOCTOR|STRANGE|NEW_YORK_SA
NCTUM|SORCERER_SUPREME
||HAL|JORDAN||EARTH|GREEN_LANTERN
2019-01-02|0212931229475|MR|KEVIN|BACON|129340212|SOME_PLACE_IN_THE_US|ACTOR
2019-01-03|2939583749642|MRS|NICOLE|KIDMAN|291928494|BEVERLY_HILLS|ACTOR
2019-03-15|2947959371923|MR|WARREN|BUFFET|2000000000|OMAHA_NEBRASKA|INVESTOR
2020-02-45|2847939172394|MRS|JULIE|ARIS|200034|SOMEWHERE_IN_THE_UK|LECTURER
There are two major problems with this dataset that I've been struggling with, the first of which is that there aren't enough delimiters in certain rows (e.g. the first row) and the second being that long values in the address column have somehow caused the address value to "spill over" into the next row. When I asked the people in charge of extracting the data from their databases why the data looked like this, this was apparently caused by people pressing "Enter" while inputting long addresses, which also meant that the data was stored this way in the database to begin with, thus ruling out any possibility of a re-extraction rectifying these errors. According to them, the easiest way to rectify the second problem is to press backspace at the head of each problematic row like this.
"DATE"|"ID"|"TITLE"|"NAME"|"SURNAME"|"INCOME"|"ADDRESS1"|"JOB"
||MR|ROBIN|WILLIAMS|291029102|
2019-01-01||MRS||DREW||SECOND_TO_THE_RIGHT_AND_STRAIGHT_ON_TILL_MORNING||
2021-02-04|||JONATHAN|SIMMONS|3012000|SOMEWHERE_IN_BEVERLY_HILLS|
|MARK|ZUCKERBURG||SILICON_VALLEY|
2019-01-02|0212931229475|MR|KEVIN|BACON|129340212|SOME_PLACE_IN_THE_US|ACTOR
2019-01-03|2939583749642|MRS|NICOLE|KIDMAN|291928494|BEVERLY_HILLS|ACTOR
2019-03-15|2947959371923|MR|WARREN|BUFFET|2000000000|OMAHA_NEBRASKA|INVESTOR
2020-02-45|2847939172394|MRS|JULIE|ARIS|200034|SOMEWHERE_IN_THE_UK|LECTURER
While technically not a bad solution, I'm not about to do this manually for a dataset that has over 2 million rows.
My first idea was to look for escape characters in the datafile using Notepad++, but there were no signs of escape characters in the dataset at all. My second idea was to try to read the datafile row by row and append it to a dataframe using the following lines of code.
import time
start = time.time()
with open("DATAFILE.txt", 'rt', encoding = 'UTF8') as f:
data = f.read()
fl = data.splitlines()
# 350231 rows
len(fl)
df1 = pd.DataFrame()
for i in range(0, 10000):
df1 = df1.append([fl[i+1].split('|')], ignore_index=True)
print('Total Time:',round((time.time() - start),2), '(sec)')
This did absolutely nothing as it gave me the exact same results as read_csv, which is why I'd very much appreciate some advice on how to solve this problem.
Example of a row with empty values filled in with NaN:
"DATE"|"ID"|"TITLE"|"NAME"|"SURNAME"|"INCOME"|"ADDRESS1"|"JOB"
NaN |NaN|MR|ROBIN|WILLIAMS|291029102|NaN
Note that this row is also missing a delimiter for the JOB column
Related
How to get rid of rows with pandas in a CSV where the value of cells in a specific column is under 100 Billion?
I'm trying to filter through a CSV and make a new CSV which is the exact same except for it gets rid of any rows that have a value of greater than 100 billion in the 'marketcap' column. The code I've written so just spits out the same CSV as the original out over again and doesn't cut out any lines from the old CSV to the new CSV. Code: db = pd.read_csv('SF1_original.csv') db = db[db['marketcap']<= 100000000000] db.to_csv('new_SF1_original.csv') Example of old CSV (It's long don't look through whole thing, just to give you an idea): ticker,dimension,calendardate,datekey,reportperiod,lastupdated,accoci,assets,assetsavg,assetsc,assetsnc,assetturnover,bvps,capex,cashneq,cashnequsd,cor,consolinc,currentratio,de,debt,debtc,debtnc,debtusd,deferredrev,depamor,deposits,divyield,dps,ebit,ebitda,ebitdamargin,ebitdausd,ebitusd,ebt,eps,epsdil,epsusd,equity,equityavg,equityusd,ev,evebit,evebitda,fcf,fcfps,fxusd,gp,grossmargin,intangibles,intexp,invcap,invcapavg,inventory,investments,investmentsc,investmentsnc,liabilities,liabilitiesc,liabilitiesnc,marketcap,ncf,ncfbus,ncfcommon,ncfdebt,ncfdiv,ncff,ncfi,ncfinv,ncfo,ncfx,netinc,netinccmn,netinccmnusd,netincdis,netincnci,netmargin,opex,opinc,payables,payoutratio,pb,pe,pe1,ppnenet,prefdivis,price,ps,ps1,receivables,retearn,revenue,revenueusd,rnd,roa,roe,roic,ros,sbcomp,sgna,sharefactor,sharesbas,shareswa,shareswadil,sps,tangibles,taxassets,taxexp,taxliabilities,tbvps,workingcapital A,ARQ,1999-12-31,2000-03-15,2000-01-31,2020-09-01,53000000,7107000000,,4982000000,2125000000,,10.219,-30000000,1368000000,1368000000,1160000000,131000000,2.41,0.584,665000000,111000000,554000000,665000000,281000000,96000000,0,0.0,0.0,202000000,298000000,0.133,298000000,202000000,202000000,0.3,0.3,0.3,4486000000,,4486000000,50960600000,,,354000000,0.806,1.0,1086000000,0.484,0,0,4337000000,,1567000000,42000000,42000000,0,2621000000,2067000000,554000000,51663600000,1368000000,-160000000,2068000000,111000000,0,1192000000,-208000000,-42000000,384000000,0,131000000,131000000,131000000,0,0,0.058,915000000,171000000,635000000,0.0,11.517,,,1408000000,0,114.3,,,1445000000,131000000,2246000000,2246000000,290000000,,,,,0,625000000,1.0,452000000,439000000,440000000,5.116,7107000000,0,71000000,113000000,16.189,2915000000 Example New CSV (Exact same when this line should have been cut): ,ticker,dimension,calendardate,datekey,reportperiod,lastupdated,accoci,assets,assetsavg,assetsc,assetsnc,assetturnover,bvps,capex,cashneq,cashnequsd,cor,consolinc,currentratio,de,debt,debtc,debtnc,debtusd,deferredrev,depamor,deposits,divyield,dps,ebit,ebitda,ebitdamargin,ebitdausd,ebitusd,ebt,eps,epsdil,epsusd,equity,equityavg,equityusd,ev,evebit,evebitda,fcf,fcfps,fxusd,gp,grossmargin,intangibles,intexp,invcap,invcapavg,inventory,investments,investmentsc,investmentsnc,liabilities,liabilitiesc,liabilitiesnc,marketcap,ncf,ncfbus,ncfcommon,ncfdebt,ncfdiv,ncff,ncfi,ncfinv,ncfo,ncfx,netinc,netinccmn,netinccmnusd,netincdis,netincnci,netmargin,opex,opinc,payables,payoutratio,pb,pe,pe1,ppnenet,prefdivis,price,ps,ps1,receivables,retearn,revenue,revenueusd,rnd,roa,roe,roic,ros,sbcomp,sgna,sharefactor,sharesbas,shareswa,shareswadil,sps,tangibles,taxassets,taxexp,taxliabilities,tbvps,workingcapital 0,A,ARQ,1999-12-31,2000-03-15,2000-01-31,2020-09-01,53000000.0,7107000000.0,,4982000000.0,2125000000.0,,10.219,-30000000.0,1368000000.0,1368000000.0,1160000000.0,131000000.0,2.41,0.584,665000000.0,111000000.0,554000000.0,665000000.0,281000000.0,96000000.0,0.0,0.0,0.0,202000000.0,298000000.0,0.133,298000000.0,202000000.0,202000000.0,0.3,0.3,0.3,4486000000.0,,4486000000.0,50960600000.0,,,354000000.0,0.8059999999999999,1.0,1086000000.0,0.484,0.0,0.0,4337000000.0,,1567000000.0,42000000.0,42000000.0,0.0,2621000000.0,2067000000.0,554000000.0,51663600000.0,1368000000.0,-160000000.0,2068000000.0,111000000.0,0.0,1192000000.0,-208000000.0,-42000000.0,384000000.0,0.0,131000000.0,131000000.0,131000000.0,0.0,0.0,0.057999999999999996,915000000.0,171000000.0,635000000.0,0.0,11.517000000000001,,,1408000000.0,0.0,114.3,,,1445000000.0,131000000.0,2246000000.0,2246000000.0,290000000.0,,,,,0.0,625000000.0,1.0,452000000.0,439000000.0,440000000.0,5.1160000000000005,7107000000.0,0.0,71000000.0,113000000.0,16.189,2915000000.0 I've seen two questions somewhat related to this on StackOverflow, but they haven't helped me much. This one uses CSV library instead of pandas (which is an option for me). This one is more helpful since it uses pandas but still hasn't been interacted with and isn't exactly the same as my use case.
You can get the indexes of the rows with "marketcap" over 100 billion rows like so: df.loc[df["marketcap"] > 100000000000]["marketcap"].index All that's left to do is drop them from the DataFrame: df.drop(df.loc[df["marketcap"] > 100000000000]["marketcap"].index, inplace=True) Reading from CSV and writing to the CSV is already correctly taken care of in your code.
Separating csv columns with delimiter / sep
My goal is to separate data stored in cells to multiple columns in the same row. For example, I would like to take data that looks like this: Row 1: [<1><2>][<3><4>][][] Row 2: [<1><2>][<3><4>][][] Into data that looks like this: Row 1: [1][2][3][4] Row 2: [1][2][3][4] I tried using the code below to pull the csv and separate each line at the ">" df = pd.read_csv('file.csv', engine='python', sep="\*>", header=None) However, the code did not function as anticipated. Instead, the separation occurred at seemingly random and unpredictable points (I'm sure there's a pattern but I don't see it.) And each break created another row as opposed to another column. For example: Row 1: [<1>][<2>] Row 2: [<3>] Row 3: [<4>] I thought the issue might lie with reading the CSV file so I tried just re-scraping the site with the separator included but it produced the same results so I'm assuming its an issue with the separator call. However, I found that call after trying many others that caused various errors. For example, when I tried using sep = '>' I got the following error: ParserError: '>' expected after '"' and when I tried sep = '\>' , I got the following error: ParserError: Expected 36 fields in line 1106, saw 120. Error could possibly be due to quotes being ignored when a multi-char delimiter is used. These errors sent me looking though multiple resources including this and this among others. However, I have find no resources that have successfully demonstrated how I can separate each column within a row following the use of a '>' delimiter. If anyone knows how to do this, please let me know. Your help is much appreciated! Update: Here is an actual screenshot of the CSV file for a better understanding of what I was trying to demonstrate above. My end goal is to have all the data is columns I+ have data on one descriptive factor as opposed to many as they do now.
Would this work: string="[<1><2>][<3><4>][][]" string=string.replace("[","") string=string.replace("]","") string=string.replace("<","[") string=string.replace(">","]") print(string) Result: [1][2][3][4]
I ended up using Google Sheets. Once you upload the csv there is a header titled "data" and then a sub-section titled "split text to columns." If you want a faster way to do this with code, you can also do the following with pandas: # new data frame with split value columns new = data["Name"].str.split(" ", n = 1, expand = True) # making separate first name column from new data frame data["First Name"]= new[0] # making separate last name column from new data frame data["Last Name"]= new[1] # Dropping old Name columns data.drop(columns =["Name"], inplace = True) # df display data
Swapping dataframe column data without changing the index for the table
While compiling a pandas table to plot certain activity on a tool I have encountered a rare error in the data that creates an extra 2 columns for certain entries. This means that one of my computed column data goes into the table 2 cells further on that the other and kills the plot. I was hoping to find a way to pull the contents of a single cell in a row and swap it into the other cell beside it, which contains irrelevant information in the error case, but which is used for the plot of all the other pd data. I've tried a couple of different ways to swap the data around but keep hitting errors. My attempts to fix it include: for rows in df['server']: if '%USERID' in line: df['server'] = df[7] # both versions of this and below df['server'].replace(df['server'],df[7]) else: pass if '%USERID' in df['server']: # Attempt to fix missing server name df['server'] = df[7]; else: pass if '%USERID' in df['server']: return row['7'], row['server'] else: pass I'd like the data from column '7' to be replicated in 'server', only in the case of the error - where the data in the cell contains a string starting with '%USERID'
Turns out I was over-thinking this one. I took a step back, worked the code a bit and solved it. Rather than trying to smash a one-size fits all bit of code for the all data I built separate lists for the general data and 2 exception I found, by writing a nested loop and created 3 data frames. These were easy enough to then manipulate individually, and finally concatenate together. All working fine now.
Pandas returns a different word count than notepad++ and excel. Which one is correct?
I have a .csv file with 3 columns and 500.000+ lines. I'm trying to get insight into this dataset by counting occurences of certain tags. When i started i used Notepad++ count function for the tags i found and noted the results by hand. Now that i want to automate that process, i use pandas to do the same thing but the results differ quite a bit. Results for all tags summed up are: Notepad++ : 91.500 Excel : 91.677 Python.pandas : 91.034 Quite a difference, and i have no clue how to explain this and how to validate which result i can trust and use. My python code looks like this and is fully functional. #CSV.READ | Delimiter: ; | Datatype: string| Using only first 3 columns df = pd.read_csv("xxx.csv", sep=';', dtype="str") #fills nan with "Empty" to allow indexing df = df.fillna("Empty") #counts and sorts occurences of object(3) category occurences = df['object'].value_counts() #filter Columns with "Category:" tags_occurences = df[df['object'].str.contains("Category:")] #displays and tags_occurences2 = tags_occurences['object'].value_counts() Edit: Already iterated through the other columns, which result in finding another 120 tags, but there is still a discrepancy. In Excel and Notepad I just open Ctrl+F and search for "Category:" using their count functions. Has anyone made a similiar experience or can explain what might cause this ? Are excel & wordpad having errors while counting ? I can't imagine pandas (being used in ML and DataScience a lot) would have such flaws.
Python - read excel data when format changes every time
I get an excel from someone and I need to read the data every month. The format is not stable each time, and by saying "not stable" I mean: Where the data starts changes: e.g. Section A may start on row 4, column D this time, but next time it may start at row 2, column E. Under each section there are tags. The number of tags may change as well. But every time I only need the data in tag_2 and tag_3 (these two will always show up) The only data that I need is from tag_2, tag_3, for each month (month1 - month8). And I want to find a way using Python first locate the section name, then find tag_2, tag_3 under that section, then get the data for month1 to month8 (number of months may change as well). Please note that I do NOT want to locate the data that I need by specifying locations in excel since the locations change every time. How to I do this? The end product should be a pandas dataframe that has monthly data for tag_2, tag_3, with a column that says which section the data come from. Thanks.
I think you can directly read it as a comma separated text file. Based on what you need you can look at the tag2 ant tag3 for each line. with open(filename, "r") as fs: for line in fs: cell_list = line.split(",") # This point you will have all elements on the line as a list # you can check for the size and implement your logic
Assuming that the (presumably manually pasted) block of information is unlikely to end up in the very bottom-right corner of the excel sheet, you could simply iterate over rows and columns (set maximum values for each to prevent long searching times) until you find a familiar value (such as "Section A") and go from there. Unless I misunderstood you, the rest of the format should consistent between the months so you can simply assume that "month_1" is always one cell up and two to the right of that initial spot. I have not personally worked with excel sheets in python, so I cannot state whether the following is possible in python, but it definitely works in ExcelVBA: You could just as well use the Range.find() method to find the value "Section A" and continue with the same process as above, perhaps writing any results to a txt file and calling your python script from there if neccessary. I hope this helps a little.