Extract text from a column by delimiters Python - python
So far I've only used powerquery to clean and automate files, and i want to step up my game and move to python, but I'm having some issues and have no one to ask so I'm coming to you for help, I'm completely new to python and learning based off youtube videos and the python for data analysis book so please bear with me for a moment.
To learn, I've been working on a project using a sample csv file, the file cover several dates and has multiple columns with different data, what I want to do is split the file into different csv based on the date on the column "DateFull" which has the dates with a yyyy-mm-dd 00:00:00 format and name the new csv files with the date.
Looking at youtube videos I came up with this piece of code
import pandas as pd
df = pd.read_csv("sample_file.csv")
split_dates = df['DateFull'].unique()
for date in split_dates:
df1 = df[df['DateFull'] == date]
split_file_name ="Samplefile_" + str(date) + ".csv"
df1.to_csv(split_file_name,index=False)
But when i run it it errors out because it tries to bring the whole name and is not acceptable. I've been looking into the split method to separate the DateFull column at the whitespace, but I don't know how to incorporate that into the code.
It's obvious that I don't have any idea of how the structure or logic of the code should be but my plan was to use the df['DateFull'].str.split() command to create two new columns, one with just the date, and one with the 00:00:00 part, then remove the last one and the original DateFull to have the trimmed date column replace it and use that one to split the csv.
I know I'm probably overcomplicating it and there's an easier way to do it, maybe just removing the time part from the original column. If that's possible it would be amazing to know how to do it. but I'd also like how to do it with my approach, since I would be practicing more methods even though the resulting code will be redundant
Any help would be greatly appreciated.
Thank you so much
You can find the documentation for the split() function here.
To do this with split():
str(date).split(" ")[0]
This splits on the whitespace and return the first (0 indexed) value in the resulting list. With this change your for loop would look like this:
for date in split_dates:
df1 = df[df['DateFull'] == date]
split_file_name ="Samplefile_" + str(date).split(" ")[0] + ".csv"
df1.to_csv(split_file_name,index=False)
Related
Converting 0-1 values in dataset with the name of the column if the value of the cell is 1
I have a csv dataset with the values 0-1 for the features of the elements. I want to iterate each cell and replace the values 1 with the name of its column. There are more than 500 thousand rows and 200 columns and, because the table is exported from another annotation tool which I update often, I want to find a way in Python to do it automatically. This is not the table, but a sample test which I was using while trying to write a code I tried some, but without success. I would really appreciate it if you can share your knowledge with me. It will be a huge help. The final result I want to have is of the type: (abonojnë, token_pos_verb). If you know any method that I can do this in Excel without the help of Python, it would be even better. Thank you, Brikena Text,Comment,Role,ParentID,doc_completeness,lemma,MultiWord_Expr,token,pos,punctuation,verb,noun,adjective abonojnë,,,,,,,1,1,0,1,0,0 çokasin,,,,,,,1,1,0,1,0,1 gërgasin,,,,,,,1,1,0,1,0,0 godasin,,,,,,,1,1,0,1,0,0 përkasin,,,,,,,1,1,1,1,0,0 përdjegin,,,,,,,1,1,0,1,0,0 lakadredhin,,,,,,,1,1,0,1,1,0 përdredhin,,,,,,,1,1,0,1,0,0 spërdredhin,,,,,,,1,1,0,1,0,0 përmbledhin,,,,,,,1,1,0,1,0,0 shpërdredhin,,,,,,,1,1,0,1,0,0 arsejnë,,,,,,,1,1,0,1,1,0 çapëlejnë,,,,,,,1,1,0,1,0,0
Using pandas, this is quite easy: # pip install pandas import pandas as pd # read data (here example with csv, but use "read_excel" for excel) df = pd.read_csv('input.csv').set_index('Text') # reshape and export (df.mul(df.columns).where(df.eq(1)) .stack().rename('xxx') .groupby(level=0).apply('_'.join) ).to_csv('output.csv') # here use "to_excel" for excel format output file: Text,xxx abonojnë,token_pos_verb arsejnë,token_pos_verb_noun godasin,token_pos_verb gërgasin,token_pos_verb lakadredhin,token_pos_verb_noun përdjegin,token_pos_verb përdredhin,token_pos_verb përkasin,token_pos_punctuation_verb përmbledhin,token_pos_verb shpërdredhin,token_pos_verb spërdredhin,token_pos_verb çapëlejnë,token_pos_verb çokasin,token_pos_verb_adjective
An update to those who may find it helpful in the future. Thank you to #mozway for helping me. A friend of mine suggested working with Excel formula because the solution with Pandas and gropuby eliminates duplicates. Since I need all the duplicates, because it's an annotated corpus, it's normal that there are repeated words that should appear in every context, not only the first occurrence. The other alternative is this: Use a second sheet on the excel file, writing the formula =IF(Sheet1!B2=1,Sheet2!B$1,"") in the first cell with 0-1 values and drag it in all the other cells. This keeps all the occurrences of the words. It's quick and it works like magic. I hope this can be helpful to others who want to convert a 0-1 dataset to feature names without having to code.
Pandas df.loc with regex
I'm working with a data set consisting of several csv files of nearly the same form. Each csv describes a particular date, and labels the data by state/province. However, the format of one of the column headers in the data set was altered from Province/State to Province_State, so that all csv's created before a certain date use the first format and all csv's created after that date use the second format. I'm trying to sum up all the entries corresponding to a particular state. At present, the code I'm working with is as follows: daily_data.loc[daily_data[areaLabel] == location].sum() where daily_data is the dataframe containing the csv data, location is the name of the state I'm looking for, and arealabel is a variable storing either 'Province/State' or 'Province_State' depending on the result of a date check. I would like to eliminate the date check by e.g. conditioning on a regular expression like Province(/|_)State, but I'm having a lot of trouble finding a way to index into a pandas dataframe by regular expression. Is this doable (and in a way that would make the code more elegant rather than less elegant)? If so, I'd appreciate it if someone could point me in the right direction.
Use filter to get the columns that match your regex >>> df.filter(regex="Province(/|_)State").columns[0] 'Province/State' Then use this to select only rows that match your location: df[df[df.filter(regex="Province(/|_)State").columns[0]]==location].sum() This however assumes that there are no other columns that would match the regex.
Merging and cleaning up csv files in Python
I have been using pandas but am open to all suggestions, I'm not an expert at scripting but am a complete loss. My goal is the following: Merge multiple CSV files. Was able to do this in Pandas and have a dataframe with the merged dataset. Screenshot of how merged dataset looks like Delete the duplicated "GEO" columns after the first set. This last part doesn't let me usedf = df.loc[:,~df.columns.duplicated()] because they are not technically duplicated.The repeated column names end with a .1,.2,etc. as I am guessing the concate adds this. Other problem is that some columns have a duplicated column name but are different datasets. I have been using the first row as the index since it's always the same coded values but this row is unnecessary and will be deleted afterwards in the script. This is my biggest problem right now. Delete certain columns such as the ones with the "Margins". I use ~df2.columns.str.startswith for this and have no trouble with this. Replace spaces, ":" and ";" with underscores in the first row. I have no clue how to do this. Insert a new column, write '=TEXT(B1,0)' formula, do this for the whole column (formula would change to B2,B3, etc.), copy the column and paste as values. I was able to do this in openpyxl although was having trouble and was not able to try the final output thanks to excel trouble. source = excel.Workbooks.Open(filename) excel.Range("C1:C1337").Select() excel.Selection.Copy() excel.Selection.PasteSpecial(Paste=constants.xlPasteValues) Not sure if it works and was wondering if it was possible in pandas, win32com or I should stay with openpyxl. Thanks all!
Pandas returns a different word count than notepad++ and excel. Which one is correct?
I have a .csv file with 3 columns and 500.000+ lines. I'm trying to get insight into this dataset by counting occurences of certain tags. When i started i used Notepad++ count function for the tags i found and noted the results by hand. Now that i want to automate that process, i use pandas to do the same thing but the results differ quite a bit. Results for all tags summed up are: Notepad++ : 91.500 Excel : 91.677 Python.pandas : 91.034 Quite a difference, and i have no clue how to explain this and how to validate which result i can trust and use. My python code looks like this and is fully functional. #CSV.READ | Delimiter: ; | Datatype: string| Using only first 3 columns df = pd.read_csv("xxx.csv", sep=';', dtype="str") #fills nan with "Empty" to allow indexing df = df.fillna("Empty") #counts and sorts occurences of object(3) category occurences = df['object'].value_counts() #filter Columns with "Category:" tags_occurences = df[df['object'].str.contains("Category:")] #displays and tags_occurences2 = tags_occurences['object'].value_counts() Edit: Already iterated through the other columns, which result in finding another 120 tags, but there is still a discrepancy. In Excel and Notepad I just open Ctrl+F and search for "Category:" using their count functions. Has anyone made a similiar experience or can explain what might cause this ? Are excel & wordpad having errors while counting ? I can't imagine pandas (being used in ML and DataScience a lot) would have such flaws.
Python - read excel data when format changes every time
I get an excel from someone and I need to read the data every month. The format is not stable each time, and by saying "not stable" I mean: Where the data starts changes: e.g. Section A may start on row 4, column D this time, but next time it may start at row 2, column E. Under each section there are tags. The number of tags may change as well. But every time I only need the data in tag_2 and tag_3 (these two will always show up) The only data that I need is from tag_2, tag_3, for each month (month1 - month8). And I want to find a way using Python first locate the section name, then find tag_2, tag_3 under that section, then get the data for month1 to month8 (number of months may change as well). Please note that I do NOT want to locate the data that I need by specifying locations in excel since the locations change every time. How to I do this? The end product should be a pandas dataframe that has monthly data for tag_2, tag_3, with a column that says which section the data come from. Thanks.
I think you can directly read it as a comma separated text file. Based on what you need you can look at the tag2 ant tag3 for each line. with open(filename, "r") as fs: for line in fs: cell_list = line.split(",") # This point you will have all elements on the line as a list # you can check for the size and implement your logic
Assuming that the (presumably manually pasted) block of information is unlikely to end up in the very bottom-right corner of the excel sheet, you could simply iterate over rows and columns (set maximum values for each to prevent long searching times) until you find a familiar value (such as "Section A") and go from there. Unless I misunderstood you, the rest of the format should consistent between the months so you can simply assume that "month_1" is always one cell up and two to the right of that initial spot. I have not personally worked with excel sheets in python, so I cannot state whether the following is possible in python, but it definitely works in ExcelVBA: You could just as well use the Range.find() method to find the value "Section A" and continue with the same process as above, perhaps writing any results to a txt file and calling your python script from there if neccessary. I hope this helps a little.